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Invited  Paper 


The  present  paper  reviews  recent  developments  in  the  compu¬ 
tation  oi  motion  and  structure  of  objects  iri  a  scene  from  a 
sequence  ot  images.  IVe  highlight  two  distinct  paradigms:  ii  the 
feature-based  approach  and  ii)  optical  flow  based  approach.  The 
comparative  meats  dements  of  these  approaches  are  discussed. 
The  current  status  of  research  in  these  areas  is  reviewed  and  future 
research  directions  are  indicated.  — 

/ 

I.  Introduction 

The  ability  to  discern  objects,  a^ceitain  their  motion,  and 
navigate  in  three-dimensional  space  through  the  use  ot 
vision  is  almost  universal  among  animals.  Incorporating 
such  vision  in  machines  is  ostensibly  a  straightforward  task 
given  the  widespread  availability  of  microcomputers,  dig¬ 
itizing  cards,  and  solid-state  cameras.  Although  it  is  fairly 
easy  and  inexpensive  to  assemble  a  computer  vision  sys¬ 
tem,  it  has  proved  surprisingly  difficult  to  achieve  a  vision 
capability  in  machines,  even  to  a  limited  degree.  This  is  not 
to  imply  that  we  are  not  using  all  sorts  of  vision  systems  and 
motion  detectors  in  a  variety  of  applications.  However,  the 
ease  with  which  humans  detect  motion  and  navigate  around 
objects,  and  the  difficulty  of  duplicating  these  capabilities 
in  machines  have  recently  led  to  major  efforts  by  computer 
engineers  and  scientists  to  understand  vision  in  man  and 
machine.  These  efforts  are  in  addition  to  and  perhaps  com¬ 
plement  current  and  earlier  endeavors  at  understanding 
human  vision  and  motion  by  psychologists  and  physiolo- 
gists. 

Broadly  speaking,  th°re  are  two  groups  of  scientists 
studying  vision.  One  group  is  studying  human/animal  vision 
with  the  goal  of  understanding  the  operation  of  biological 
vision  systems  including  their  limitations  and  diversity.  The 
scientists  in  this  group  include  neurophysiologists,  psy¬ 
chophysicists,  and  physicians.  The  second  group  of  sci¬ 
entists  includes  computer  scientists  and  engineers  con- 
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ducting  research  in  computer  vision  with  the  objective  of 
developing  vision  systems.  Vision  systems  with  the  ability 
to  navigate,  recognize,  and  track  objects  and  estimate  their 
speed  and  direction  are  the  ultimate  goals  of  the  latter 
research.  The  knowledge  and  results  of  research  in  neu¬ 
rophysiology  and  psychophysics  have  influenced  the 
design  of  vision  systems  by  engineers  and  scientists.  At  the 
same  time,  results  in  computer  vision  have  provided  a 
framework  for  modeling  biological  vision.  Such  cross-fer¬ 
tilization  of  ideas  will  continue  to  yield  better  models  for 
biological  and  machine  vision  systems. 

There  is  a  long  list  of  applications  motivating  a  strong 
interest  in  sensing,  interpretation,  and  description  of 
motion  from  a  sequence  or  a  collection  of  images.  The  auto 
matic  tracking  and  possible  ticketing  of  speeding  vehicles 
on  a  highway  is  of  interest  to  traffic  engineers  and  law 
enforcement  officers.  The  automatic  recognition,  tracking, 
and  possible  destruction  of  targets  is  of  immense  interest 
to  the  department  of  defense  of  every  country.  The  com¬ 
putation,  characterization,  and  understanding  of  human 
motion  in  dancing,  athletics,  and  pilot  training  are  impor¬ 
tant  to  several  diverse  disciplines.  The  analysis  of  scinti¬ 
graphic  image  sequences  of  the  human  heart  is  of  interest 
in  assessing  motility  of  the  heart  in  diagnosis  and  super¬ 
vision  of  patients  after  heart  surgery.  Satellite  imagery  pro¬ 
vides  an  opportunity  for  interpretation  and  prediction  of 
atmospheric  processes  through  the  estimation  of  shape  and 
motion  parameters  of  atmospheric  disturbances  for  the 
meteorologist.  The  bandwidth  reduction  achievable 
through  the  estimation  of  motion  allows  for  compression 
of  image  sequences  for  efficient  transmission.  The  above 
examples  are  indicative  of  the  diversity  of  applications 
where  the  computation  of  motion  from  a  sequence  of 
images  is  of  critical  importance. 

This  broad  interest  in  the  interpretation  of  motion  from 
a  sequence  of  images  has  been  evident  since  the  first  work¬ 
shop  on  motion  in  Philadelphia  in  1979  [1],  Since  that  work¬ 
shop,  several  additional  meetings  and  special  issues  of  var- 
ious  journals  have  contributed  to  the  exchange  of  ideas  and 
the  dissemination  of  results.  In  addition,  there  have  been 
several  sessions  on  motion  and  related  issues  at  meetings 
such  as  the  IEEE  Computer  Society  Computer  Vision  and 
Pattern  Recognition  Conference  and  conferences  of  other 
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societies  interested  in  vision.  The  list  of  workshops  and  spe¬ 
cial  issues  devoted  exclusively  to  motion  and  time-varying 
imagery  include  three  special  issues  [2j-[4|,  two  books  [5], 
[6],  a  NATO  Advanced  Study  Institute  (7],  an  ACM  workshop 
[8j,  a  European  meeting  on  time-varying  imagery  [9],  and  a 
host  of  survery  papers  (10]-[  1 5].  The  extent  of  the  breadth 
and  depth  of  interest  is  provided  by  the  table  of  contents 
of  the  book  published  to  document  the  proceedings  of  the 
NATO-ASI  [16].  However,  this  list  is  incomplete  at  best.  The 
IEEE  Computer  Society  workshop  at  Kiawah  Island  [17]  and 
Second  International  Conference  in  Italy  [18]  are  indica¬ 
tions  of  the  broad  interest  in  motion  at  this  time.  The  recent 
two-volume  collection  of  papers  in  the  reprint  series  [19] 
published  by  IEEE  Computer  Society  includes  a  section  on 
Image  Sequence  Analysis  containing  nine  papers.  The 
recent  book  edited  by  Martin  and  Aggarwal  entitled  Motion 
Understanding:  Robot  and  Human  Vision  [20]  gives  eleven 
papers  detailing  recent  developments  in  this  area. 

The  above  brief  chronology  documents  the  contribu¬ 
tions  from  acomputer  vision  perspective.  It  is  not  the  inten¬ 
tion  of  the  present  review  to  slight  the  earlier  pioneering 
works  of  psychologists  and  other  scientists.  In  particular, 
the  kinetic  depth  effect  demonstrated  by  Wallach  and 
O'Connell  [21]  through  the  use  of  wire  frame  objects,  and 
similar  effects  shown  by  Gibson  [22]  in  his  translucent  sheet 
experiments,  Ullman  [23]  in  his  rotating  cylinders  experi¬ 
ment,  and  joahannson  [24]-[28]  are  important  contribu¬ 
tions  in  the  area  of  psychophysics  of  motion  perception. 
In  the  same  vein,  the  contributions  of  Hubei  and  Wiesel  [29] 
in  demonstrating  the  existence  of  specialized  cortical  cells 
tuned  to  the  detection  of  motion  are  seminal  contributions 
in  neurophysiology.  The  present  review,  however,  is  only 
aimed  at  the  computer  vision  inspired  contributions  to  the 
study  of  motion.  A  more  balanced  review  of  the  recent  con¬ 
tributions  in  both  psychophysics  of  vision  and  machine 
vision  is  found  in  [20]. 

In  this  paper  we  do  not  present  an  exhaustive  compen¬ 
dium  of  recent  research  in  the  computation  of  motion  and 
structure  from  sequences  of  images;  instead  we  list  some 
of  the  important  work  done  and  provide  a  flavor  of  the 
approaches  that  have  been  developed. 

II.  Methodologies  for  Motion  Estimation 

The  relative  motion  between  objects  in  a  scene  and  a 
camera,  gives  rise  to  the  apparent  motion  of  objects  in  a 
sequence  of  images.  This  motion  may  be  characterized  by 
observing  the  apparent  motion  of  a  discrete  set  of  features 
or  brightness  patterns  in  the  images.  The  objective  of  the 
analysis  of  a  sequence  of  images  is  the  derivation  of  the 
motion  of  the  objects  in  the  scene  through  the  analysis  of 
the  motion  of  features  or  brightness  patterns  associated 
with  objects  in  the  sequence  of  images. 

Two  distinct  approaches  have  been  developed  for  the 
computation  of  motion  from  image  sequences.  The  first  of 
these  is  based  on  extracting  a  set  of  relatively  sparse,  but 
highly  discriminatory,  two-dimensional  features  in  the 
images  corresponding  to  three-dimensional  object  fea¬ 
tures  in  the  scence,  such  as  corners,  occluding  boundaries 
of  surfaces,  and  boundaries  demarcating  changes  in  sur¬ 
face  reflectivity.  Such  points,  lines  and/or  curves  are 
extracted  from  each  image.  Inter-frame  correspondence  is 
then  established  between  these  features.  Constraints  are 


formulated  based  on  assumptions  such  as  rigid  body 
motion,  i.e.,  the  3-D  distance  between  two  features  on  a 
rigid  body  remains  the  same  after  object/camera  motion. 
Such  constraints  usually  result  in  a  system  of  nonlinear 
equations.  The  observed  displacement  of  the  2-D  image  fea¬ 
tures  are  used  to  solve  these  equations  leading  ultimately 
to  the  computation  of  motion  parameters  of  objects  in  the 
scene. 

The  other  approach  is  based  on  computing  the  optic  flow 
or  the  two-dimensional  field  of  instantaneous  velocities  of 
brightness  values  (gray  levels)  in  the  image  plane.  Instead 
of  considering  temporal  changes  in  image  brightness  val¬ 
ues  in  computing  the  optic  flow  field,  it  is  possible  to  also 
consider  temporal  changes  in  values  that  are  the  result  of 
applying  various  local  operators  such  as  contrast,  entropy, 
and  spatial  derivatives  to  the  image  brightness  values.  In 
either  case,  a  relatively  dense  flow  field  is  estimated,  usually 
at  every  pixel  in  the  image.  The  optic  flow  is  then  used  in 
conjuction  with  added  constraints  or  information  regard¬ 
ing  the  scene  to  compute  the  actual  three-dimensional  rel¬ 
ative  velocities  between  scene  objects  and  camera. 

A  task  that  is  closely  related  to  the  estimation  of  motion 
is  the  task  of  estimation  of  the  structure  of  the  imaged  scene. 
In  the  case  of  the  optic  flow  method,  this  consists  of  group¬ 
ing  pixels  corresponding  to  distinct  objects  into  separate 
regions,  i.e.,  segmenting  the  optic  flow  map,  and  then  com¬ 
puting  the  three-dimensional  coordinates  of  surface  points 
in  the  scence  corresponding  to  each  pixel  in  the  image  at 
which  the  flow  is  computed.  In  the  case  of  the  feature-based 
analysis,  computing  structure  corresponds  to  forming 
groups  of  image  features  for  each  object  in  the  scene  and 
then  computing  the  3-D  coordinates  of  each  object  feature 
associated  with  each  image  feature. 

Although  structure  may  be  computed  independent  of 
motion,  e.g.,  via  stereopsis,  the  former  process  can  benefit 
by  the  estimated  motion.  Knowledge  of  motion  parameters 
for  features/regions  can  aid  segmentation  of  image  fea¬ 
tures/regions  corresponding  to  distinct  objects.  In  ste¬ 
reopsis,  knowledge  of  object  motion  can  facilitate  estab¬ 
lishment  of  feature  correspondence  within  a  pair  of  stereo 
images,  thus  aiding  the  determination  of  structure.  Image 
regions  with  different  apparent  2-D  motions  can  be  con¬ 
sidered  to  correspond  to  distinct  objects.  Psychological 
research  has  collected  enough  evidence  to  support  the 
belief  that  the  process  of  establishing  correspondence  and 
the  process  of  estim  structure  and  motion  are  closely 
interwoven  in  the  he;  u.  visual  mechanism.  Indeed,  Ull¬ 
man  has  shown  that  apf  nt  motion  is  a  clue  used  by  the 
human  visual  system  for  computing  scene  structure  [6].  This 
close  relationship  between  the  estimation  of  structure  and 
the  estimation  of  motion  has  prompted  many  researchers 
to  address  both  tasks  as  a  combined  problem.  In  this  paper 
we  discuss  the  combined  task  of  computing  structure  and 
motion  from  image  sequences. 

In  the  following  sections  we  discuss  in  greater  detail  the 
fundamental  principles  underlying  the  two  distinct  meth¬ 
odologies  for  computing  3-D  motion  from  apparent  motion. 
The  basic  mathematical  formulations  are  introduced  and 
discussed.  In  Section  III  we  discuss  the  feature-based 
method  for  estimation  of  motion  from  a  sequence  of  mon¬ 
ocular  images.  In  Section  IV  we  discuss  the  optic  flow 
method  for  sequences  of  monocular  images.  Section  V  dis¬ 
cusses  the  relative  merits  and  demerits  of  these  two 
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approaches.  The  two  approaches  outlined  above  allow  for 
the  estimation  of  motion  without  requiring  that  scene 
structure  be  known  a  prion.  The  use  of  stereopsis  allows  for 
the  estimation  of  depth,  i.e. ,  the  distance  from  the  sensor 
to  the  objects.  The  additional  information  available  greatly 
reduces  the  complexity  of  motion  estimation.  The  variety 
of  ways  in  which  stereopsis  can  be  used  to  facilitate  the 
computation  of  motion  is  outlined  in  Section  VI.  Finally, 
Section  VII  concludes  this  paper  with  a  few  closing  remarks. 

III.  CtAtuRt-BASto  Monos  Estimation  from  Monocular 
I MAGt  SlCJUlNllS 

In  this  section,  we  discuss  the  feature-based  approach  to 
estimate  motion  from  a  sequence  of  images  gathered  by  a 
single  camera.  A  mathematical  formulation  is  presented  and 
variations  of  this  formulation  are  discussed.  The  discussion 
focuses  on  the  estimation  of  both  motion  and  structure.  No 
distinction  is  made  between  the  situations  where  a)  the 
camera  is  moving  and  imaged  scene  is  stationary,  b)  camera 
is  stationary  while  the  imaged  objects  are  in  motion,  or  c) 
both  camera  and  imaged  objects  are  in  motion.  What  is 
computed  is  the  relative  position  and  motion  between  the 
camera  and  the  imaged  scene.  In  the  following  liscussion 
it  is  assumed  that  image  features,  such  as  points  and  lines, 
have  been  extracted  from  each  image  and  inter-frame  cor¬ 
respondence  has  already  been  established  between  the 
features. 

We  present  below  three  approaches  to  feature-based 
analysis  of  monocular  image  sequences.  The  first  of  these 
is  the  direct  formulation  in  which  rigid  body  motion  is 
assumed.  In  this  formulation  the  rigidity  constraint  is  man¬ 
ifest  in  there  being  single  rotation  and  translation  matrices 
tor  all  observables.  In  the  second  approach  rigidity  is  explic¬ 
itly  invoked  with  the  formulation  being  based  on  preserv¬ 
ing  rigidity,  e.g.,  preserving  the  angle  between  two  inter¬ 
secting  3-D  lines  lying  on  a  rigid  object.  These  two  schemes 
use  two  or  three  views  to  estimate  structure  and  motion. 
A  third  approach  consists  of  using  a  long  sequence  of  mon¬ 
ocular  images.  A  brief  description  of  the  salient  features  of 
each  approach  is  presented. 

A.  Direct  Formulations 

An  orthographic  imaging  model  was  used  by  Ullman  [6j, 
[23]  to  estimate  the  structure  and  motion  of  an  object 
undergoing  rigid  motion.  The  position  and  motion  of  tour 
noncoplanar  points  in  space  were  recovered  from  three  dis¬ 
tinct  orthographic  projections  of  these  points.  The  for¬ 
mulation  is  as  follows.  Let  O,  A,  8,  and  C  be  the  four  points. 
The  orthographic  projection  of  these  points  in  three  dis¬ 
tinct  planes  HI,  112,  03  are  given  and  the  3-D  configuration 
of  these  points  is  to  be  determined.  A  fixed  coordinate  sys¬ 
tem  with  origin  at  O  is  chosen.  Let  a,  6,  c,  be  the  vectors  from 
O  to  A,  8  and  C,  respectively.  Let  each  image  have  a  coor¬ 
dinate  system  with  its  origin  at  the  projection  of  O,  and  its 
axes  along  the  directionsp,,  q,.  Note  that  p;and  q,  are  orthog¬ 
onal  unit  vectors  on  Ilf.  Let  the  image  coordinates  of  (A,  B, 
C)  on  TI/  be  (x,„y,„,  xh,yhl,  x,,y,,),  and  let  u,t  be  the  unit  vector 
along  the  intersection  of  11/  and  II/. 

The  image  coordinates  are  given  by  the  dot  products 

^  J'Pi.  y.„  =  a  q„  xh,  =  b  pi, 

Vk.  =  b  q„  x,,  -  r  p„  y„  -  c  q,. 


The  unit  vector  u()  lies  on  W  which  is  spanned  by  (p„  q(), 

hence 

u,,  =  «.,P;  +  d,,q„  where  ^  =  1. 

The  unit  vector  u,y  also  lies  on  11/  which  is  spanned  by 
(p,,  q, ),  hence 

«//  =  ‘ )„Pi  +  5ii<lr  where  +  5;)  =  1. 

From  the  latter  two  equations  we  obtain 

«</Pi  +  A/ 9/  =  7  i,Pj  + 

and  taking  the  scalar  product  of  this  equation  with  a,  b,  and 
c  we  get: 

=  7/,  xj,  +  5,,y,„ 

«./*/..  +  0,iYi„  =  7  +  \yh, 

+  ti„y„  =  7 ./  xc  /  +  6,,y,r 

These  equations  are  linearly  independent  [b]  and  possess 
two  solutions  that  are  equal  in  magnitude  but  have  oppo¬ 
site  sign.  Choosing  one  of  these  solutions,  the  vectors  u/( 
are  determined.  The  distances  dl  =  II  u,2  -  t/,,11,  d2  =  ||u,2 
-  II,  and  d3  =  II  u)t  -  w2i  U  are  then  computed.  When  no 
two  vectors  uti  are  equal,  then  di  *  0  and  a  unique  triangle 
with  sides  di,  d2,  and  d3  is  specified.  Consider  the  tetra¬ 
hedron  formed  by  this  triangle  and  the  origin  O,  with  the 
vertices  of  the  triangle  being  placed  at  unit  distance  from 
the  origin  O.  From  the  projections  of  A,  B,  and  C  on  the 
three  planes  (images)  a  unique  3-D  configuration  is  easily 
computed.  In  the  degenerate  case,  i.e.  when  two  of  the  u ,y 
are  identical,  straighttorward  trigonometric  considerations 
provide  recovery  of  the  structure  and  motion  of  the  body 
123], 

Although  the  parallel  projection  model  is  adequate  in 
some  situations  it  is  not  appropriate  for  most  real-world 
applications  which  mandate  the  use  of  perspective  pro¬ 
jection.  The  useof  perspective  transformation  substantially 
increases  the  complexity  of  the  problem.  Roach  and  Aggar- 
wal  [30],  [31]  were  among  the  first  to  compute  structure  and 
motion  from  images  via  the  perspective  imaging  transfor¬ 
mation.  A  scenario  consisting  of  a  static  scene  and  a  moving 
camera  was  assumed.  The  goal  was  to  investigate  whether 
it  would  be  possible  to  determine  the  position  of  the  points 
in  space  and  the  movement  (translation  and  rotation)  of  the 
camera. 

The  equations  that  relate  the  three-dimensional  coor¬ 
dinates  of  a  point  (X,  V,  Z)  and  its  image  plane  coordinates 
(x,  y)  are 

x  =  r  *■■(*  -  X„)  +  a,.(V  -  Yn)  +  a„(Z  -  Zn) 

a„(X  -  X„)  +  a,_,<y  -  Y„)  +"a„(Z  -  Z„) 

f  a.,(X  -  X„)  +  a,,(V  -  T„)  +  a,,(Z  -  Z„) 

V  a  *  i  (X  -  X„)  +  a,,(V  -  Vn)  +  a„(Z  -  Z„Y 

Here  F  is  the  focal  length,  (X„,  V,„  Z„)  is  the  projection  cen¬ 
ter  and  a,;.  a,j,  •  •  •  ,  aM  are  functions  of  (0,  4>,  T'),  the  ori¬ 
entation  of  the  camera  with  respect  to  the  global  reference 
system. 

Roach  and  Aggarwal  showed  that  five  points  in  two  views 
are  needed  to  recover  these  parameters  [30],  [31],  They 
related  the  number  of  points  and  the  number  of  equations 
available  for  the  solution  of  3-D  coordinates  and  motion 
paia.iiwtei  b  as  follows:  The  global  roordina'"  s  of  e/vh  point 
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are  unknown  so  the  five  points  produce  15  variables.  The 
camera  position  and  orientation  parameters  (X„,  V(1,  Z„,  0, 
<t>,  and  t )  in  two  views  contribute  another  12  variables  yield¬ 
ing  a  total  of  27  variables.  Fach  3-D  point  produi  es  two  pro- 
jection  equations  per  camera  position  thus  forming  a  total 
ot  20  nonlinear  equations.  To  make  the  number  of  equa¬ 
tions  equal  the  number  ot  unknowns,  seven  variables  must 
be  known  or  specified  a  priori.  This  is  achieved  by  choosing 
the  six  camera  parameters  ot  the  first  view  to  be  zero  and 
setting  the  2-component  ot  one  of  the  five  points  to  an  arbi¬ 
trary  positive  constant  to  fix  the  scaling  factor.  The  reason 
tor  tixmg  one  variable  as  the  scaling  constant  is  that  under 
the  given  camera/object  constraints  the  information 
embedded  in  every  image  sequence  is  inherently  insuffi¬ 
cient  tor  determining  the  correct  scale.  For  example,  the 
observed  projected  motion  of  an  object  moving  in  space 
can  be  reproduced  by  another  object  which  is  twice  as  large, 
twice  far  away  from  the  camera,  translating  twice  as  fast, 
and  rotating  with  the  same  speed  around  an  axis  of  the  same 
orientation  as  the  former  object.  In  general,  the  informa¬ 
tion  of  the  absolute  distance  of  the  object  from  the  viewer 
is  usually  lost  in  the  image  formation  process.  Therefore, 
arbitrarily  setting  the  scale  is  not  unreasonable  in  finding 
the  solution  for  the  structure  and  motion  parameters. 

An  iterative  finite  difference  Levenberg-Marquardt  algo¬ 
rithm  was  used  to  solve  these  18  nonlinear  equations  (after 
fixing  the  scale  factor  two  of  the  20  nonlinear  equations  have 
no  unknown  variables  in  them).  For  noise-free  simulations, 
the  methods  typically  converged  to  the  correct  answer 
within  15  seconds  on  a  Cyber  170/50  and  hence  are  rea¬ 
sonably  efficient.  It  noise  is  introduced  into  the  point  posi¬ 
tions  in  the  image  plane,  a  considerably  overdetermined 
system  ot  equations  is  needed  to  attain  good  accuracy  of 
the  results.  Two  views  of  12  or  even  15  points,  or  three  views 
of  seven  or  eight  points  are  usually  needed  in  the  noisy 
cases. 

Unlike  Roach  and  Aggarwal  [30],  [31]  who  solved  the 
motion  parameters  through  a  single  system  of  equations 
thus  creating  a  large  search  space,  Nagel  [32]  proposed  a 
technique  which  reduces  the  dimension  of  the  search  space 
through  the  elimination  of  unknown  variables  Ihe  impor¬ 
tant  observation  made  by  Nagel  was  tha'  the  translation 
vector  can  be  eliminated  and  the  rotation  matrix  can  be 
solved  separately.  A  rotation  matrix  is  completely  specified 
by  three  parameters— namely  the  orientation  of  the  rota¬ 
tion  axis  and  the  rotation  angle  around  this  axis.  It  is  shown 
that  if  measurements  of  five  points  in  two  views  are  avail¬ 
able,  then  three  equations  can  be  written  and  the  three  rota¬ 
tion  parameters  can  be  solved  for  separately  from  the  trans¬ 
lation  parameters.  The  distance  of  the  configuration  of 
points  from  the  viewer  is  arbitrarily  fixed  and  the  translation 
vector  can  then  be  determined. 

Tsai  and  Huang  [33]-[35]  proposed  a  method  to  find  the 
motion  of  a  planar  surface  patch  from  2-D  perspective  views. 
The  algorithms  consists  of  two  steps:  First,  a  set  of  eight 
"pure  parameters"  is  defined.  These  parameters  can  be 
determined  uniquely  from  two  successive  image  frames  by 
solving  a  set  of  linear  equations.  Then,  the  actual  motion 
parameters  are  determined  from  these  eight  "pure  param¬ 
eters"  by  solving  a  sixth-order  polynomial. 

By  exploiting  the  constraints  of  projective  geometry  and 
rigid  motion,  equations  can  be  written  to  relate  the  coor¬ 
dinates  of  image  points  in  the  two  frames  for  points  on  a 


planar  surface  patch  AX  +  BY  +  CZ  =  1,  where  A,  B,  and 
C  are  the  structure  parameters.  The  mapping  from  the 
(x,  y)  space  to  the  (x',  y')  space  (from  one  image  to  the  next 
image)  is  given  by 

a,x  +  a,v  +  a,  ,  a4x  +  ay  +  a,, 
x  =  - : -  y  = - 

a-x  +  a„y  +  1  '  a7x  +  afly  +  1 

where,  a,  through  a„  are  the  eight  "pure  parameters"  and 
can  be  expressed  in  terms  of  the  focal  length,  the  structure 
parameters  (A,  B,  C ),  and  the  motion  parameters  Nx,  Ny,  Nz, 
O,  Tx,  T )  and  TZ(N  specifies  the  rotation  axis,  O  is  the  rota¬ 
tional  angle,  and  T  is  the  translational  vector).  For  a  partic¬ 
ular  set  ot  pure  parameters,  the  above  equation  represents 
a  mapping  from  (x,  y)  space  to  (x',  y ')  space.  A  set  of  linear 
equations  is  solved  to.  these  eight  pure  parameters. 

After  the  eight  pure  parameters  are  obtained,  the  struc¬ 
ture  and  motion  parameters  can  be  determined.  Here,  the 
Z  component  of  the  translation  vector  is  arbitrarily  chosen 
to  fix  the  scale.  After  a  series  of  manipulations,  it  L  possiblv- 
to  get  a  sixth-order  polynomial  equation  in  terms  of  only 
one  of  the  variables  T[  =  Tx/Tz.  T'x  is  solved  first  and  then 
all  the  remaining  structure  and  motion  parameters  can  be 
easily  obtained.  Although  potentially  six  real  roots  may 
result  from  solving  a  -axth-order  polynomial,  the  authors 
reported  that  aside  from  a  scale  factor  for  the  translation 
parameters,  the  number  of  real  solutions  never  exceeded 
two  in  their  simulation. 

Later,  Tsai  and  Huang  [36]  investigated  the  problem  of  a 
curved  surface  patch  in  motion.  Two  main  results  were 
established  concerning  the  existence  and  uniqueness  of 
the  solutions.  An  E  matrix  was  specified  as  E  =  TR.  where 
T  is  the  translation  and  R  is  the  rotation.  Given  the  image 
correspondences  of  eight  object  points  in  general  posi¬ 
tions,  the  £  matrix  can  be  determined  uniquely  by  solving 
eight  linear  equations.  Furthermore,  the  actual  3-D  motion 
parameters  can  be  determined  uniquely  given  E,  and  can 
be  computed  by  taking  the  singular  value  decomposition 
of  E  without  having  to  solve  nonlinear  equations.  Detailed 
proofs  of  these  claims  are  presented  by  the  authors  [36]. 
Although  the  approach  results  in  the  solution  of  a  set  of 
linear  equations,  the  system  is  highly  sensitive  to  noise  and 
especially  to  perturbations  of  image  coordinates.  Longuet- 
Higgins  [37],  [38]  worked  independently  to  obtain  results 
similar  to  those  described  above.  He  derived  the  E  matrix 
and  presented  a  method  to  recover  R  and  T  from  E  using 
tensor  and  vector  analysis. 

Extensions  of  the  above  approaches  were  proposed  by 
several  researchers  [39]-[43].  One  limitation  of  the 
approaches  developed  by  Tsai  and  Huang[36]and  Longuet- 
Higgins  [37]  is  the  requirement  of  a  priori  knowledge 
regarding  nonzero  translation.  Zhuang  and  Haralick  [39]- 
[41]  have  developed  an  algorithm  which  overcomes  this  lim¬ 
itation.  Zhuang  and  Haralick  do  require  that  the  observed 
object  points  do  not  lie  on  a  specific  quadratic  surface  pass¬ 
ing  through  the  origin.  Faugeras,  Lustman  and  Toscani  [42] 
and  Nagel  [43]  reformulated  the  problem  in  more  robust 
manners  as  least-mean-squared  error  minimization  prob¬ 
lems. 

The  above  approaches  used  3-D  points  and  their  projec¬ 
tions  on  the  image  planes  as  observables  in  formulating  the 
problem.  An  alternative  approach  is  to  use  3-D  lines  and 
their  projections  as  observables.  When  lines  are  used  as 
features,  two  views  are  no  longer  sufficient  and  a  minimum 
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of  three  views  are  required.  This  is  due  to  the  fact  that  3-D 
lines  possess  an  additional  degree  of  freedom  when  com¬ 
pared  to  3-D  points.  In  other  words,  one  can  slide  a  3-D  line 
along  itself  and  obtain  the  same  line.  We  present  below  an 
overview  of  some  techniques  that  use  lines  as  features  in 
the  estimation  ot  structure  and  motion. 

Yen  and  Huang  [44],  [45]  have  proposed  an  iterative 
method  based  on  spherical  projection  and  on  the  obser¬ 
vation  of  seven  line  correspondences  in  three  views  for  the 
case  of  general  motion  between  views.  Liu  and  Huang  [46], 
[47]  have  used  line  correspondences  in  formulations  anal¬ 
ogous  to  the  methods  outlined  above.  They  decompose 
rigid  body  motion  into  first  a  rotation  around  an  axis 
through  the  origin  and  then  a  translation.  For  the  case  of 
pure  rotation,  two  line  correspondences  over  two  frames 
are  sufficient  to  determine  the  rotation  matrix.  The  result¬ 
ing  nonlinear  equations  are  solved  iteratively.  For  the  case 
ot  pure  translation,  five  line  correspondences  over  three 
frames  produce  a  system  of  linear  equations  which  can  be 
solved  to  determine  the  translation.  For  the  general  case, 
Liu  and  Huang  use  six  line  correspondences  in  three  frames. 
The  lotation  matrix  is  first  determined  and  then  the  trans¬ 
lation  matrix  is  computed.  Simulations  of  the  iterative  algo¬ 
rithm  on  synthesized  data  show  that  the  approach  is  highly 
sensitive  to  noise  and  initial  estimates.  Moreover,  esti¬ 
mation  of  the  translation  vector  is  very  sensitive  to  errors 
in  estimation  ot  rotation.  The  algorithm  has  not  been  tested 
on  real  data. 

A  more  robust  formulation  of  motion  estimation  using 
line  correspondences,  which  incorporates  the  effect  of 
noise,  is  due  to  Faugeras,  Lustman  and  Torscani  [42].  An 
extended  Kalman  filtering  approach  is  followed  in  solving 
the  nonlinear  equations  for  a  "best"  estimate  of  the  motion 
parameters.  The  "best''  estimate  is  defined  to  be  one  that 
minimizes  an  expression  that  involves  the  measurables,  the 
unknowns,  and  partial  derivatives  of  the  nonlinear  equa¬ 
tion  that  relates  the  unknowns  to  the  measurables.  The 
measurables  for  each  3-D  line  consist  of  three  vectors,  one 
for  each  of  the  three  image  planes.  Each  vector  corresponds 
to  the  unit  normal  ot  the  plane  containing  the  projection 
of  the  3-D  line  and  the  center  of  projection  for  that  image 
plane.  The  unknowns  consist  of  the  rotation  parameters 
that  relate  the  positions  of  the  three  image  planes.  After 
solving  for  the  rotation,  the  translation  is  computed  via  lin¬ 
ear  equations.  The  structure  of  the  object  can  then  be  com¬ 
puted  via  either  a  least-squares  technique  or  via  the  Kalman 
filtering  approach.  Significant  improvement  was  reported 
in  sensitivitv  to  noise  and  initial  estimates. 

Implicit  in  the  above  discussion  was  the  assumption  that 
the  scene  contained  a  single  rigid  ob|ect.  Feature-based 
motion  analysis  has  also  been  applied  to  scenes  containing 
multiple  rigid  and  jointed  objects.  Webb  and  Aggarwal  [48] 
have  presented  a  method  for  recovering  the  3-D  structure 
of  such  sc  enes  under  orthographic  projection.  The  fixed- 
axis  assumption  is  adopted  to  interpret  images  of  moving 
objects.  The  fixed-axis  assumption  asserts  that  every  rigid 
object  movement  consists  of  a  translation  plus  a  rotation 
about  an  axis  which  is  fixed  in  direction  for  a  short  period 
of  time.  It  is  shown  that,  under  the  fixed-axis  assumption, 
selecting  any  point  on  a  rigid  moving  object  as  the  origin 
of  a  coordinate  system  causes  the  other  points  to  trace  out 
circles  in  planes  normal  to  the  fixed-axis  within  that  coor¬ 
dinate  system.  Under  parallel  projection,  with  the  selected 


point  projecting  to  the  image  origin,  these  circles  project 
into  ellipses.  The  structure  of  the  rigid  object  can  be 
recovered  to  within  a  reflection  by  finding  the  equations 
describing  the  ellipses.  Furthermore,  it  is  shown  that  the 
lengths  of  the  long  and  short  axes  of  an  ellipse  are  functions 
of  the  position  of  the  point  in  space.  The  position  of  each 
point  in  space  (up  to  a  reflection  about  the  image  plane)  can 
then  be  recovered  provided  that  the  fixed  axis  of  rotation 
is  not  parallel  or  perpendicular  to  the  image  plane. 

A  jointed  object  is  an  object  made  up  of  a  number  of  rigid 
parts  which  cannot  bend  or  twist.  If  the  jointed  object  still 
moves  in  a  way  such  that  the  fixed-axis  assumption  holds 
for  each  rigid  part,  then  the  motion  and  structure  of  the 
jointed  object  can  be  recovered.  It  is  assumed  that  the  rigid 
parts  are  connected  by  joints  identified  since  they  satisfy 
two  sets  of  motion  constraints.  If  the  joints  are  not  visible, 
they  can  be  found  by  solving  a  system  of  linear  equations. 
The  joints  can  then  be  used  to  eliminate  some  reflections 
and  thus  the  number  of  possible  interpretations  of  struc¬ 
ture  is  reduced.  Finally,  the  3-D  motion  of  each  object  is 
reconstructed. 

B.  Explicit  Use  of  Rigidity 

The  assumption  of  rigid  body  was  implicitly  used  in  the 
above  formulations.  We  out  line  below  a  typical  formulation 
in  which  the  constraint  of  rigid  body  motion  is  explicitly 
invoked  [49],  We  discuss  the  case  where  five  points  in  two 
views  are  used  as  the  observables.  As  in  the  above  discus¬ 
sion,  the  relative  positions  of  the  cameras  are  unknown, 
and  the  correspondence  between  points  in  the  two  views 
is  assumed  known. 

The  two  central  projection  imaging  systems  are  shown 
in  Fig.  1.  C,  and  C>  are  the  centers  of  projection  and  /,  and 


Fig.  1.  Imaging  geometry  tor  the  two  views.  P,  is  the  3-D 
point,  p,  and  q,  are  the  images  ot  Pon  the  two  image  planes. 


/.  are  the  image  planes.  A  point  P,  in  space  with  coordinates 
(X„  Y„  Z„)  in  S,  and  iU„  V,,  W,)  in  S,  is  imaged  as  p,  on  /,  and 
q,  on  I,.  The  objective  of  the  analysis  is  to  derive  the  struc¬ 
ture  of  the  points  and  the  transformation  between  the  coor¬ 
dinate  systems,  given  the  image  coordinates  ot  the  observed 
points  in  the  two  imaging  coordinate  systems. 

Because  P,  is  on  line  C,p,  (refer  to  Fig.  1),  there  exists  a  real 
number  X,  >  1  such  that 

X,  =  \,x„  Y,  =  \,y„  Z,  =  (1  -  X,)  f, 

where  (x,,  y,)  are  the  coordinates  of  p,  in  the  /,-image  coor¬ 
dinate  system,  and  f,  is  the  distance  from  C,  to  the  image 
plane.  Similarly,  P,  is  on  line  C_,q,  and  if  <u„  v,)  are  the  coor- 
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dinates  of  q,  in  the  I,  coordinate  system  then  there  exists 
7,  >  1  such  that 

U;  =  7,u,,  V,  =  7,1',,  VV,  =  (1  -  7,)/,. 

The  squared  distance  between  points  P,  and  P,  expressed 
in  S,  is  therefore 

d;,(S,)  =  (X,  -  X,)-'  +  (V',  -  Y,)'  +  (Z,  -  Zf 
or 

c/'(Sp  -  (X,x,  -  X,x,)'  +  ( X, -  X,y,)J  +  (X,  -  X,)7|‘, 

Similarly,  the  squared  distance  between  P  and  P,  expressed 
in  S:  is 

dr,(S.)  =  (7,u,  -  7 ,u,)-  +  ( 7 , V',  -  7,1',)-  +  (7,  -  7,)-7t. 

Now,  the  principle  of  conservation  of  distance  allows  us  to 
write  (assuming,  of  course,  identical  units  of  measurement 
in  S,  and  SO 

d“(S|)  =  d;,(S0 


or 


( X,  x,  -  X,x,)-  +  (X,y,  -  X,y,)J  +  (X,  -  X,)7f 

=  (',,u,  -  7 ,u,)’  +  (7, v,  -  7;v,)-  +  (7,  -  7,)-7t.  (3.1) 

it  may  be  seen  that  each  point  P,  contributes  two 
unknowns,  X,  and  7,,  and  each  pair  of  points  (P,,  P,)  gives 
one  second  order  equation  (3.1).  Therefore,  5  points  yield 
lOequations  and  10  unknowns.  Again,  fixing  the  scale  arbi¬ 
trarily,  we  end  up  with  a  system  of  10  equations  in  9 
unknowns.  Note  that  each  equation  involves  only  4  of  the 
unknowns.  Since  distances  between  points  define  struc¬ 
ture  only  up  to  a  reflection  in  space,  the  solution  of  system 
(3.1)  based  on  these  distances  is  also  subject  to  this  uncer¬ 
tainty.  System  (3.1),  although  simple,  is  nevertheless  non¬ 
linear.  Experimental  results  using  existing  iterative  numer¬ 
ical  methods  do  indicate,  however,  that  the  solution  is  well 
behaved  [49]. 

When  the  position  of  the  points  has  been  computed, 
determining  the  relative  position  of  the  cameras  becomes 
a  simple  matter.  Indeed,  take  4  noncoplanar  points  (from 
the  5  observed  points  in  space)  and  call  A,  and  A,  the  matri¬ 
ces  of  homogeneous  coordinates  of  these  in  5,  and  S2, 
respectively.  Then  if  M  is  the  transformation  matrix  (in 
homogeneous  coordinate  form)  that  takes  5r  onto  S2  we 
have 


A,  =  A, AT.  (3.2) 

Since  the  4  points  are  not  coplanar,  (3.2)  can  be  solved 
for  M.  Now  if  we  decompose  motion  M  into  i)  a  rotation 
through  angle  ©about  an  axis  through  the  origin  with  direc¬ 
tion  cosines  n1(  n2,  n(,  followed  by  ii)  a  translation  ((,,  t{i 
and  if  it  is  written  as 


M  = 


4 1 

a4  a-, 
a7  afl 
f,  © 


a,  0 
ah  0 
a9  0 
7  1 


then  one  can  show  that 

cos  O  =  (a,  +  as  +  a9  -  1)/2;  sin  0  =  (ab  —  a8)/2n, 
n,  =  7(a,  -  cos  Q)/(1  -  cos  0) 
n,  =  (a,  +  a4)  (1  -  cos  0)/2n, 
n,  =  (a,  +  a7)  (1  -  cos  0)/2n,. 

The  algorithm  has  been  shown  to  perform  well  on  both 
real  and  synthetic  data,  and  these  results  are  presented  in 
149], 

The  use  of  lines  as  observables  in  an  approach  similar  to 
the  one  outlined  above  has  also  been  attempted  by  Mitiche, 
Seida  and  Aggarwal  [50]  who  used  the  principle  of  angular 
invariance  between  3-D  lines  on  a  rigid  body  undergoing 
motion.  In  their  method  the  orientation  of  lines  is  first 
recovered,  then  the  rotational  component  is  computed,  and 
finally,  the  translation  is  recovered.  The  observation  of  four 
lines  in  three  views  allows  for  the  determination  of  struc¬ 
ture  and  motion  parameters. 

The  use  of  line  correspondences  has  the  advantage  over 
the  use  of  point  correspondences  in  that  extraction  of  lines 
in  images  is  less  sensitive  to  noise  than  extraction  of  points. 
Also,  it  is  easier  to  match  line  segments  between  images 
than  it  is  to  match  points. 

It  is  possible  to  use  both  lines  and  points  concomitantly 
in  formulating  the  task.  In  the  case  of  combined  point  and 
line  correspondences,  four  points  and  a  line  in  two  views 
are  sufficient  to  compute  the  structure  of  the  scene  as  well 
as  the  displacement  between  views  as  described  by  Aggar¬ 
wal  and  Wang  [51], 

The  following  observations  may  be  made  based  on  the 
current  literature: 

1)  Using  points  or  lines,  or  combination  of  points  and 
lines  for  the  computation  of  structure  and  motion 
usually  gives  rise  to  nonlinear  equations. 

2)  The  computation  based  upon  minimum  number  of 
points  or  lines  is  usually  more  sensitive  to  noise  per¬ 
turbations. 

3)  I  n  general,  alternate  formulations  may  give  rise  to  dif¬ 
ferent  sufficiency  conditions  regarding  minimum 
number  of  points  and  lines  required  for  solving  struc¬ 
ture  and  motion. 

C.  Using  Extended  Sequences  of  Monocular  Images 

The  approaches  outlined  above  attempt  to  recover  struc¬ 
ture  and  motion  from  a  limited  number  of  views  of  the 
scene,  typically  3  or  4  views.  We  discuss  below  some  tech¬ 
niques  that  use  long  sequences  of  monocular  images  to 
recover  structure  and  motion. 

The  first  of  these  is  the  incremental  approach  which 
allows  for  deviations  from  rigid  body  motion.  This  differs 
from  the  approaches  outlined  above  which  assumed  that 
the  object  being  imaged  undergoes  rigid  body  motion.  Psy¬ 
chophysical  studies  have  shown  that  the  human  visual  sys¬ 
tem  can  cope  with  less  than  strict  rigidity  [52],  [26],  [27],  These 
studies  prompted  Ullman  to  devise  an  algorithm  that 
recovers  the  3-D  structure  of  viewed  objects  in  an  incre¬ 
mental  manner  using  several  views  of  an  object  in  motion 
[52].  The  performance  of  the  algorithm  is  argued  to  be  com¬ 
parable  to  that  of  the  human  visual  system  because  it  pos¬ 
sesses  the  following  characteristics  [52]: 
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1)  At  each  instant  there  exists  an  estimate  of  the  3-D 
structure  of  the  viewed  object.  The  internal  model 
M(t)  of  the  viewed  structure  at  time  f  may  be  initially 
crude  and  inaccurate,  and  may  be  influenced  by  static 
sources  of  3-D  information. 

2)  The  recovery  process  prefers  rigid  transformations. 

3)  It  is  able  to  integrate  information  from  an  extended 
viewing  period. 

4)  The  recovery  process  tolerates  deviations  from  ri¬ 
gidity. 

5)  It  eventuallv  recovers  the  correct  3-D  structure,  or  a 
close  approximation  to  it. 

A  parallel  projection  system  is  used.  MU)  consists  of  a  set 
of  3-D  coordinates  (X,,  Y,,  Z,)  where  (X,,  Z,)  are  the  observed 
image  plane  coordinates  of  a  point  and  Y,  is  the  depth.  The 
estimation  of  structure  therefore  consists  of  finding  Yr  An 
initial  set  of  values  is  chosen  for  the  Y,.  Consider  the  sit¬ 
uation  at  time  f.  Let  (x„  y„  z,)  be  the  new  structure  of  the 
corresponding  points.  The  task  is  to  find  y,  while  minimiz¬ 
ing  deviations  from  rigidity.  The  deviation  from  rigidity  is 
defined  as  follows.  Let  L,(  denote  the  distance  between 
points  /  and  /  at  time  t.  Let  L,',  denote  the  distance  between 
points  /  and  /  at  time  f'.  Under  rigid  motion  L}/  should  be 
equal  to  L,j.  The  deviation  in  rigidity  is  expressed  as 

. ,  u„  -  l;,)1 

f  =  L  D„,  where  D„  =  - - , 

L  ii 

and  the  summation  is  for  all  i,  j. 

Two  modifications  to  the  basic  scheme  were  explored 

[52] ,  These  included  using  different  metrics  for  measuring 
the  deviation  from  rigidity  and  allowing  for  a  correction  in 
the  initial  model  MU).  Simulations  using  synthetic  data  were 
conducted.  Results  indicate  that  (tie  model  does  arrive  at 
a  good  approximation  to  the  3-D  structure  after  several 
views,  but  does  not  converge  to  the  exact  solution.  Also, 
the  solution  is  unique  upto  a  mirror  reflection.  The  mod¬ 
ification  involving  a  flexible  model  quickly  arrived  at  a  good 
approximation  with  a  few  views  but  with  additional  views 
the  estimated  structure  oscillated  about  the  correct  solu¬ 
tion.  An  analysis  of  the  convergence  properties  of  this  algo¬ 
rithm  has  also  been  carried  out  by  Hildreth  and  Grzywacz 

[53] .  They  have  also  suggested  a  continuous  formulation  of 
the  above  approach  wherein  instantaneous  velocities  of  the 
points  are  used  instead  of  point  positions. 

Although  it  is  argued  that  such  a  formulation  is  warranted 
when  arbitrarily  close  frames  are  used,  the  results  of  Hil¬ 
dreth  and  Grzywacz  indicate  that  local  velocity  information 
is  insufficient  to  solve  the  problem,  even  when  the  object 
is  viewed  over  an  extended  period.  The  major  limitation  of 
the  incremental  approach  discussed  above  is  that  it  per¬ 
forms  well  only  when  objects  rotate  about  a  fixed  axis.  In 
addition,  orthographic  projection  is  not  generally  valid.  The 
approach  does  however  illustrate  the  importance  of  motion 
in  the  perception  of  structure. 

Broida  and  Chellappa  [54]  consider  the  case  of  a  rigid  body 
undergoing  constant  translational  and  rotational  motion. 
This  assumption  allows  for  a  formulation  in  which  the  num¬ 
ber  of  unknown  model  parameters  does  not  increase  with 
the  increase  in  the  number  of  image  frames.  A  two-dimen¬ 
sional  object  undergoing  one-dimensional  motion  is 
assumed.  They  also  assume  that  the  object  structure  is 
known  and  attempt  to  recover  the  motion  parameters.  A 


Kalman  filter  is  employed  for  recursive  estimation  of  the 
motion  parameters.  The  object  is  assumed  to  be  transpar¬ 
ent  so  that  feature  points  are  always  visible  and  corre¬ 
spondence  is  assumed  to  have  been  established  a  priori. 
The  unknown  model  parameters  are  represented  as  a  vec¬ 
tor: 

[xc  xc  zc  zc  p  1  p2  w]' 

where,  (xc,  zc)  is  the  location  of  the  center  of  mass  of  the 
object,  (xc,  zc)  is  the  object  translational  motion,  pi  and  p2 
are  unknown  phase  angles  of  moment  arms  rl  and  2  that 
connect  the  two  feature  points  to  the  center  of  mass.  Here 
rl  and  r2/r1  is  assumed  known.  The  differential  equation 
describing  unforced  motion  is  written  in  terms  of  the  above 
vector  as: 

x(t)  -  [xc  0  xc  0  w  iv  0]' 

with  arbitrary  initial  conditions  xc(t),  zcU).  pHt),  and  pllt). 
This  system  yields  the  following  state  equation: 

x(k  +  1)  =  F(k)  x{k) 

where 


xik)  =  [xc(A)  xc(k)  zc(A-)  zc(k)  pl(A)  plik)  w[k)]'  and 


F(k)  = 


1  r  0  0  0 
0  10  0  0 
0  0  1  r  0 
0  0  0  1  0 
0  0  0  0  1 
0  0  0  0  0 
0  0  0  0  0 


0  0 
0  0 
0  0 
0  0 
0  7 
1  7 
0  1 


Here,  r  is  the  time  interval  between  successive  images.  The 
measurement  model  is  given  by 

XI  =  L[xc  +  rl  cos  (p1)]/[zc  +  rl  sin  (pi)]  =  h1[x(A)] 

X2  =  L[xc  +  rl  cos  (p2)]/[zc  +  rl  sin  (p2)]  =  h2[xik)] 


where  XI  and  X2  are  the  images  of  the  two  feature  points 
and  L  is  the  focal  length  of  the  sensor.  The  vector  repre¬ 
sentation  is  given  by 

X(k)  =  (X1(/c)  X2(A)]r  =  h[x(k)]  +  nik) 

where  fi[x]  =  [hi  (x)b 2(x)]  and  n{k )  is  the  term  corresponding 
to  zero  mean,  Gaussian,  spatially  correlated,  and  tempo¬ 
rally  white  noise. 

The  above  formulation  is  then  used  to  design  an  iterated 
extended  linear  Kalman  filter  that  solves  for  the  state  vari¬ 
ables— in  this  case  the  translation  and  rotation  parameters. 
The  performance  of  the  algorithms  on  Monte  Carlo  sim¬ 
ulations  are  discussed  in  [54],  while  extensions  of  this 
approach  are  presented  in  [55], 

Weng,  Huang  and  Ahuja  [56]  have  proposed  a  method  of 
characterizing  rigid  body  motion  from  long  monocular 
image  sequences,  i.e.,  over  extended  viewing  periods.  Their 
approach  involves  first  extracting  structure  and  motion 
parameters  with  two  views  of  8  points  [33]-[36]  and  then 
computing  the  trajectory  of  the  rotation  center  which  is  the 
center  of  mass  or  some  fixed  point  of  the  object.  They 
assume  that  the  angular  momentum  of  the  object  is  locally 
constant  and  the  object  possesses  an  axis  of  symmetry.  They 
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argue  that  if  motion  is  smooth  and  the  time  interval  covered 
by  the  model  is  relatively  short,  then  the  trajectory  of  the 
rotation  center  can  be  approximated  by  a  polynomial.  The 
developed  model  is  applied  to  subsequences  of  images  to 
estimate  the  traiectory  and  predict  the  new  locations  of 
object  points.  1  he  main  characteristic  of  interest  is  the  exis¬ 
tence  of  precessional  motion  and  the  parameters  thereof. 
A  least-squares  method  is  adopted  to  compute  the  param¬ 
eters.  The  authors  present  a  detailed  analysis  of  the  rela¬ 
tionship  between  the  parameters  ot  precessional  motion 
and  discrete  two-view  motion.  The  simulations  discussed, 
however,  deal  onlv  with  3-D  point  sets  and  no  testing  has 
been  conducted  using  real  data  extracted  from  monocular 
image  sequences. 

D.  The  Correspondenc  e  Problem 

In  the  above  discussions  it  is  repeatedly  assumed  that 
correspondence  was  available  between  features  extracted 
from  one  image  in  a  sequence  of  images  and  those  ext rac  ted 
from  the  next  image.  The  task  ot  establishing  and  main¬ 
taining  such  correspondence  is,  however,  nontrivial.  The 
ambiguity  is  aggravated  bv  the  effects  ot  occlusion  which 
cause  features  to  appear  or  disappear  and  also  give  rise  to 
"false"  features.  The  development  ot  robust  tec  hniques  to 
solve  the  correspondence  problem  is  an  active  area  of 
research  that  is  still  in  its  infancy.  We  present  a  briet 
description  of  a  few  of  the  approaches  developed.  The 
problem  of  finding  correspondence  is  common  to  other 
areas  of  computer  vision  such  as  stereoscopy  and  optic  tlow. 
Some  of  the  techniques  developed  for  solving  the  corre¬ 
spondence  problem  in  these  other  areas  can  be  applied  to 
the  feature-based  analysis  of  monocular  images  as  well,  and 
vice  versa. 

Aggarwal  et  a/.  [57]  have  classified  correspondenc  e  pro¬ 
cesses  into  two  categories:  those  that  are  based  on  iconic 
models  and  those  that  are  based  on  structural  models.  The 
former  class  consist  of  templates  extracted  from  the  first 
frame  which  are  then  detected  in  the  second  and  subse¬ 
quent  frames.  The  second  approach  consists  of  extracting 
tokens  with  a  number  of  attributes  from  the  first  image,  and 
using  domain  constraints  and  structural  models  to  match 
these  tokens  with  those  extracted  from  the  second  and  sub¬ 
sequent  images.  The  latter  approach  is  computationally 
more  expensive  but  also  more  robust  than  the  former. 

Sethi  and  lain  [58]  describe  a  method  for  finding  corre¬ 
spondence  and  maintaining  correspondence  between  fea¬ 
ture  points  extracted  from  a  long  sequence  of  monocular 
images.  They  present  algorithms  based  on  preserving  the 
smoothness  of  velocity  changes.  The  iterative  optimization 
algorithms  search  for  an  optimum  set  of  trajectories  for  fea¬ 
ture  points  in  a  sequence  of  images  based  on  constraints 
on  the  direction  and  magnitude  of  change  in  motion,  A 
hypothesize  and  test  approach  is  also  proposed  to  handle 
occlusion.  This  method  hypothesizes  occlusion  if  the  num¬ 
ber  of  feature  points  detected  in  a  frame  is  less  than  that 
detected  in  two  or  more  preceeding  or  succeeding  frames. 
Interpolating  the  missing  point  position  using  the  pre¬ 
ceeding  two  frames  and  testing  this  with  the  subsequent 
two  frames  verifies  the  existence  of  occlusion.  Experiments 
with  manually  extracted  features  illustrate  that  the  approach 
is  able  to  deal  with  limited  occlusion.  The  problem  of  auto¬ 
mated  extraction  of  features,  however,  has  not  been 
addressed  by  the  authors. 


Fang  and  Huang  [59]  have  presented  experimental  results 
of  motion  parameter  estimation  using  a  modified  version 
ot  an  algorithm  initially  developed  by  Ranade  and  Rosen- 
feld  [60].  The  telaxation  algorithm  is  modified  by  incor¬ 
porating  different  scales  to  allow  for  large  scale  change j  in 
the  images  (due  to  large  translations  in  depth).  Another 
relaxation  technique  for  establishing  correspondence  is 
due  to  Kim  and  Aggarwal  [61]-|63]  who  have  applied  their 
technique  to  matching  features  in  stereo  imagery  as  well 
as  for  matching  3-D  features  in  depth  maps.  Barnard  and 
Thompson  [64]  have  proposed  an  iterative  relaxation  label¬ 
ing  technique  lor  matching  features  in  stereo  imagery  based 
on  smoothness  in  change  ot  depth.  This  method  may  be 
applied  to  matching  features  in  two  monocular  images 
based  on  smoothness  in  spatial  displacement  of  image  fea¬ 
tures.  Prager  and  Arbib  [65]  describe  a  technique  similar  to 
Barnard  and  Thompson  but  have  included  an  additional 
temporal  constraint  on  feature  displacements.  Many  other 
approaches  to  matching  image  features  can  be  found  in 
recent  literature,  tor  example  see  [66]-|68], 

In  this  section  we  discussed  the  teature-based  extraction 
of  motion  from  monocular  image  sequences.  It  was 
assumed  that  image  features,  such  as  points  and  lines,  had 
been  extracted  from  each  image  and  inter-frame  corre¬ 
spondence  had  been  established  between  them.  Three 
approac  hes  to  the  problem  were  discussed:  the  direc  t  for¬ 
mulation  method  where  rigid  bodv  motion  is  implicitly 
used,  a  formulation  in  which  rigidity  is  explic  itlv  invoked, 
and  the  third  approac  h  using  long  sequenc  es  of  monoc  ular 
images. 

IV.  Of’tic  Flow  Bxstt)  Monos  Fsiimxiios 

In  this  section  we  present  approaches  in  which  the 
instantaneous  changes  in  brightness  values  in  the  image 
are  analyzed  to  yield  a  dense  veloc  ity  map  c  alled  image  flow 
or  optic  flow.  The  three-dimensional  motion  and  struc  ture 
parameters  are  then  computed  based  on  various  assump¬ 
tions  and/or  additional  information.  No  correspondence 
between  features  in  successive  images  is  required.  The  optic 
flow  techniques  rely  on  local  spatial  and  temporal  deriv¬ 
atives  of  image  brightness  values.  This  approach,  as  will  be 
evident  from  the  following  disc  ussion,  is  distinct  from  the 
feature-based  analysis  of  monocular  image  sequences  dis¬ 
cussed  in  the  previous  section  where  1)  a  relatively  sparse 
set  of  two-dimensional  features  is  extracted  from  the 
images,  2)  inter-frame  correspondence  is  established 
between  these  features,  3)  constraints  are  formulated  based 
on  assumptions  such  as  rigid  body  motion,  and  4)  the 
obaerved  displacement  of  the  2-D  image  features  are  used 
to  solve  these  equations  to  produce  3-D  structure  and 
motion  estimates. 

The  relative  motion  of  a  scene  with  respect  to  the  viewer 
gives  rise  to  a  distribution  of  velocities  on  the  image  plane. 
This  phenomenon  manifests  itself  as  temporal  change  in 
brightness  values  (gray  levels)  in  the  image  plane.  The  image 
velocities  are,  in  general,  functions  of  the  motion  of  viewed 
objects  relative  to  the  camera,  objects'  locations  in  3-D 
space,  and  3-D  structure  of  the  objects.  The  recovery  of  the 
3-D  motion  and  structure  information  from  the  sequence 
of  monocular  images  can  be  decomposed  into  two  steps: 
1)  compute  image  plane  velocities  from  changes  in  image 
intensity  values,  and  2)  use  optic  flow  to  compute  3-D 
motion  and  structure.  We  discuss  below  some  basic  for- 
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mulationx  of  these  two  problems  and  outline  the  salient 
features  in  solutions  to  these  two  tasks. 

A.  Computing  Optic  Flow 

Let  gtx,  v,  .')  be  the  image  intensity  at  point  (.x,  y)  in  the 
image  at  time  t.  With  the  assumption  that  the  intensity  is 
the  same  at  time  t  -t-  At  tor  the  point  (<  *  Ax,  v  Ay)  ot  the 
image,  we  have 

gtx  Ax.  v  Ax,  t  +  At I  =  g t.x,  y.  t)  (4. 1 ) 

where  At.  A*,  and  A\  are  small  Approximating  the  lett-hand 
side  b\  a  Tax  lor  series 

t>i  v  r  Ax.  v  *  At,  t  ♦  All  '  gtx,  v .  fi  e  g\Ax 

*  g,A\  -  g.Af  -  higher  order  terms.  (4.2) 

Ignoring  the  higher  order  terms  in  (4.2),  using  (4.1)  in  (4.2) 
and  taking  the  limit  as  At  -*  0 

g,u  gp  *  g-  =  0.  (4.4) 

In  this  equation,  the  partial  derivatives  g,.  g.  and  g,  are 
estimated  from  the  image,  u  -  c/x  dt  and  i  -  d i  dt  are  the 
x eloatx  components  in  the diret  tions  x  and  i .  respectively . 
assoc  lated  w ith  the  point  (x,  v).  The i  ul lection  ot  such  veloc- 
itx  xei  tors  tor  the  entire  image  constitutes  the  optic  tloxv 
tor  the  image 

Equation  1 4.  i i  embodies  two  unknowns  u  and  i,  and  is 
not  suttic  lent  bv  it  so  I  f  to  spec  itx  the  optic  al  flow  uniquely. 
It  does  constrain  the  solution.  It  is  possible  to  compute  op* <■ 
cal  tloxx  for  images  using  the  optical  flow  constraint  equa¬ 
tion  together  with  additional  assumptions.  Popular 
assumptions  include  one  of  the  following: 

a)  optical  tloxx  is  smooth  and  neighboring  points  have 
similar  velocities, 

bi  optical  flow  is  constant  over  an  entire  segment  of  the 
image, 

c)  optical  tloxv  is  the  result  of  restricted  motion,  for 
example,  planar  motion. 

One  such  constraint  is  the  smoothness  constraint,  i.e., 
motion  field  varies  smoothly  in  most  parts  of  the  image  [69]- 
[72],  Horn  and  Schunck  [69]  imposed  this  constraint  by  min¬ 
imizing  the  error  in  optic  flow  expressed  as: 

f:(x,  y)  =  (error  in  (4.3>)  +  X‘  (deviation  from  smoothness) 

=  tg.u  -  g,x  +■  g,):  -  X:{(u;  +  u;)  +  (iq  +  vt’)} 

(4.4) 

where  X  is  a  constant.  The  task  is  to  find  u  and  v  so  as  to 
minimize  R  in  the  following 

R  =  \  \  {<£■“  +  gW  +  8d~ 

+  X-'((u;  +  u;’)  +■  (vq  +  vf)]}  dx  dy.  (4.5) 

The  integral  equation  may  be  solved  by  methods  of  cal¬ 
culus  of  variation.  Differentiating  (4.5)  with  respect  to  u  and 
v  and  equating  dRj'du  and  dR/dv  to  zero  (for  minimum  error 
R),  and  writing  (u;  +  of)  =  u  -  uav(i,  and  (v2,  +  \/v)  =  v  -  vavp, 
we  get  the  following: 

u  =  uav<,  -  gxP/D,  v  =  vavp  -  gyP/D  (4.6) 

where 

P  =  +  g^avp  +  g,),  and  D  =  \2  +  gl  +  g2y. 


Equation  (4.6)  may  be  solved  iteratively,  i.e.,  obtain  u(t),  v(t) 
using  uave»  -  1),  vav„(t  -  1). 

Horn  and  Schunck  show  that  the  iterative  method  con¬ 
verges  when  the  optic  flow  is  static  i.e.,  when  the  velocity 
vectors  do  not  change  with  time,  e.g.,  a  sphere  rotating 
about  a  stationary  axis.  When  this  c  ondition  is  violated,  e.g., 
when  an  object  translates  in  front  of  a  stationary  back¬ 
ground,  there  exist  boundaries  where  local  smoothness  of 
optic  flow  will  not  hold.  If  the  boundaries  can  be  detected 
then  the  technique  may  be  limited  to  smooth  regions.  Some 
techniques  for  determining  such  boundaries  are  discussed 
by  Schunck  [73]. 

The  first-order  approximation  of  (4.2)  is  unsatisfactory  for 
edges  and  corners  in  the  image  [74],  First-  and  second-order 
derivatives  of  the  Taylor  series  expansion  of  (4.2)  were  used 
by  Snyder  ef  al.  [75]  who  obtained  a  single  nonlinear  equa¬ 
tion  in  the  txvo  unknowns  u  and  v.  Prazdny  [76]  used  the 
approach  suggested  by  Snyder  e(  at.  (75]  to  solve  the  prob¬ 
lem  where  only  pure  translation  of  the  sensor  was  involved. 
Prazdny  further  assumes  that  the  Foe  us  of  Expansion  (FOE)1 
o'  image  flow  is  assumed  known  and  then  solves  for  the 
magnitude  of  the  image  flow. 

Yachida  [77]  extended  Horn  and  Schunc  k  s  iterative 
method  discussed  above  [69]  for  computing  optic  tloxv.  The 
smoothness  constraint  considered  not  only  a  spatial  neigh¬ 
borhood  within  the  frame  but  also  a  temporal  neighbor¬ 
hood,  i.e.,  areas  in  the  preceeding  and  succeeding  frames. 

In  order  to  devise  additional  constraints  to  solve  the 
image  flow  equation  (4.3)  Nagel  [74],  [78]  has  posed  specific 
conditions  on  local  gray  value  distributions  and  has  pre¬ 
sented  an  operator  (gray  value  corner  detector)  that  detects 
locations  in  the  image  that  satisfy  these  conditions.  He 
develops  the  Taylor  series  of  (4.2)  up  to  second-order  terms. 
Minimizing  an  error  functional  results  in  a  system  of  txvo 
nonlinear  equations  in  u  and  v.  These  yield  a  closed  form 
solution  for  the  optic  flow  at  the  image  locations  detected 
by  the  corner  detector.  Nagel  and  Enkelmann  [79]  use  these 
values  as  initial  estimates  in  an  iterative  algorithm  that 
extends  the  solution  of  the  nonlinear  system  of  equations 
into  image  areas  surrounding  the  gray  value  corner.  Nagel 
[80]  has  also  proposed  a  modification  of  Horn  and  Schunck's 
smoothness  criterion  to  take  into  consideration  occluding 
edges  Nagel  introduced  a  weight  matrix  which  depends  on 
gray  level  changes  in  such  a  way  that  smoothness  require¬ 
ment  is  retained  only  for  the  optical  flow  component  which 
is  perpendicular  to  strong  gray  value  transitions. 

Haralick  and  Lee  [81]  use  (4.3)  in  conjunction  with  the 
requirement  that  the  first  derivatives  of  the  gray  value  struc¬ 
ture  that  has  been  displaced  in  the  image  due  to  object 
motion  must  remain  the  same.  This  vie.ds  three  additional 
equations: 

+  g.A  +  g.,  =  0 
gv,u  +  g„v  +  g„  =  0 

gi.u  +  g,A  +  g„  =  0.  (4.7) 

Equations  (4.7)  and  (4.3)  form  an  overdetermined  system 
of  four  linear  equations  in  u  and  v.  Tretiak  and  Pastor  [82] 

’The  Focus  of  Expansion  (FOE)  is  defined  as  the  intersection  of 
the  axis  of  camera  translation  with  the  image  plane,  when  the  inter¬ 
section  occurs  on  the  positive  half  of  the  axis.  When  this  inter¬ 
section  lies  on  the  negative  half  of  the  axis  of  translation,  it  is  termed 
the  Focus  of  Contraction  (FOC). 
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also  independently  arrived  at  a  similar  formulation.  The 
solution  ot  the  system  ot  equations  is  ettei  ted  In  the  pseu¬ 
doinverse  formalism  [78|,  (82]. 

Hildreth  (8  i]  has  developed  a  si  heme  tor  computing 
:magv'  veloi  it\  vectors  along  contours  .ormeci  In  detecting 
/ero-c  rossmgs  ot  the  Laplai  ian  ot  Gaussian  (LOG)  tittered 
image  [b7],  This  approach  is  based  on  Marr  s  theory  that 
initial  motion  measurement  .  In  the  human  visual  system 
are  made  only  at  locations  ot  sign  meant  intensity  i  hanges. 
the  two-dimensional  velcu  itv  held  along  the  contour  is 
described  In  the  vector  tunction  Vis).  where  s  denotes 
arc  length.  Vts>  can  he  decomposed  into  components  v  (st 
and  i  *si  that  are  perpendu  ular  and  tangential,  respectively, 
to  the  c  ontoui . 

Vi  si  c  i si u  isi  -  i  exit/  ist,  c4.8) 

v  In-re  u  isi  and  u  ■  s i  are  unit  vet  tors  m  tin-  three  turns  per- 
penduular  and  tangential  to  the  curve.  An  orthographic 
proit"  non  geometry  is  used,  ‘solutions  to  4.8)  tor  the  sim¬ 
ple  i  Uses  ot  constant  veloi  itv  and  rigid  motion  in  image 
plane  are  d  W  ussed.  I  he  applu  anon  ot  a  ..tore  general  c  on- 
strum!  is  th  n  discussed,  i.e.,  the  assumption  that  velocity 
vans  smoothly  along  the  contour.  To  measure  total  vari¬ 
ation  in  the  vt-loc  itv  held  the  following  continuous  tunc- 
t  nnu!  is  proposed 

i'  iiV 

Oi  Vi  n  — ,  r/s.  (4.9 

.  :  4  s  I 


1  h's  isi  mu  hi  red  with  the  c  onstraint  that  the  perpend  ic  ular 
i  omponent  ot  thk  computed  velocity  tield  V  •  u  must  be 
c  lose  to  the  measured  per  pend  ic  ular  c  omponent  v"  to  form 
the  to  Mow  i  ng  tunc  tonal 


OiV, 


its  1 


c/s 


-  .i  |  iV  ■  u  -  v  *  (Is  (4. 10) 

where  .5  is  a  weighting  tac  tor.  A  disc  rete  torm  ot  the  above 
tunc  tonal  is  spec  iliuci 

<1-  -  1>-  -  4>.  (4.11) 


*!>.  -•  11  [tv,  -  v.  r  +  a;  -  V'.,  ,r] 

-  (V,  -  V,  )-  t  tV. .  -  V'.  )-’  (4.12) 

4’..  =  ,i  X  I  V  U';  *  V.  (/;  -  \ "]'  (4.13) 


where  k  is  the  number  of  points  in  the  contour.  In  order 
to  tind  the  velocities  ( V, ,  V, )  whic  h  minimize  4>,  4 4>/ctV,  and 
c)4>''dV,  are  equated  to  zero.  This  yields  2k  linear  equations 
whic  h  are  solved  via  the  conjugate  gradient  algorithm  [8.3]. 
Experimental  results  using  real  data  have  been  conducted 
where  the  initial  perpendicular  components  of  velocity 
were  computed  from  the  time  derivative  between  two  LOG 
filtered  images,  and  the  gradient  along  the  zero-crossing 
contours  of  the  first  filtered  image.  Experiments  on  syn¬ 
thetic  data  show  that  the  smoothness  criterion  does  :  ot 
guarantee accurateestimatesof  image  flow.  It  isargued  that 
the  velocity  field,  even  though  incorrect,  is  perceptually 
valid. 

Nagel  [78]  has  presented  a  comparative  analysis  of  the 


above  schemes  ot  Horn  and  Schunk  [69],  Haralick  and  Lee 
[81],  Tretiak  and  Pastor  [82],  Nagel  [80],  and  Hildreth  [83] 
using  a  mathematic  al  tormalism  developed  by  him  and  has 
shown  the  relationship  between  these  approaches. 

The  above  approai  hes  deal  with  images  at  a  single  scale 
ot  resolution,  i.e.,  the  finest  resolution  available  from  the 
imaging  sensor.  Several  hierarchical  schemes  have  been 
developed  |84]-[87).  Enkelmann  [84]  c  reates  a  Guassian  low- 
pass  pyramid  tor  eac  h  image.  Processing  begins  at  a  coarse 
level  wherein  the  initial  displacement  vectors  ai  e  set  to  zero, 
1  hese  v  ec  tors  are  projected  to  liner  levels  v  ia  bi-linear  inter¬ 
polation.  Within  eac  h  level,  the  velocity  field  is  computed 
via  Nagel's  approac  h  [80]  which  embodies  the  oriented 
smoothness  criterion.  A  finite  dittereme  anproach  yields 
a  large  sparse  system  ot  linear  equations  which  is  solved 
using  a  multi-resolution  relaxation  approach,  Glazer's 
approach  [85]  uses  Horn  and  Sc  hunk's  c  riteria  [69],  Glazer 
uses  a  Gaussian  pyramid  with  quad-tree  connectivity  to 
propagate  velocity  vectors  from  coarse  to  fine  levels.  Glazer 
uses  a  finite  dilteience  approach  and  a  complex  multi-level 
relaxation  approach  which  involves  dynamic  switching 
between  levels.  Anandan  [87]  uses  a  Laplacian  pyramid 
which  provides  a  set  of  bandpass  titters  (as  opposed  to  the 
low-pass  tilters  provided  bv  Gaussian  pyramids).  A  coarse 
to  tine  control  strategy  is  also  employed  via  an  "ov  erlapped 
projec  tion  sc  heme”  that  allows  for  multiple  c  hoices  in  the 
propagation  of  velocity  vectors.  Anandan's  technique  is 
based  on  establishing  matches  between  image  events  in 
successive  frames.  The  match  criterion  used  is  the  min¬ 
imization  of  a  Gaussian  weighted  sum-of-squared-difter- 
ences  iSSDi  in  a  5  x  5  window  and  a  confidence  measure 
based  on  the  distribution  of  the  SSD  values.  A  smoothness 
constraint  similar  to  that  of  Glazer  is  used.  The  minimi¬ 
zation  problem  is  solved  via  a  finite-element  method  that 
takes  into  consideration  known  discontinuities  in  the  dis¬ 
placement  field. 

Another  method,  called  the  multi-constraint  method,  is 
emerging  with  promise.  In  this  method  one  considers  sev¬ 
eral  functions  /,,  f,,  ■  ■  ■  ,  f„  such  that  each  of  them  satisfy 
the  constraint  equation.  In  particular, 

ill,  df,  df. 

^u  +  a:v  +  67  =  0'  '  =  i- 

Candidate  functions  include  directional  derivatives.  How¬ 
ever,  the  results  based  upon  these  functions  have  not  been 
promising.  Other  candidate  functions  include  g  -  0(0 
where  O  is  an  operator  like  the  contrast,  entropy,  average, 
etc.  Mitiche,  Wang  and  Aggarwal  [88]  have  reported  pre¬ 
liminary  success  in  the  computation  of  optical  flow  using 
multi-constraint  methods. 

Fleet  and  lepson  [89]  and  Tsotsos  ef  al.  [90]  have  inves¬ 
tigated  the  extraction  of  motion  information  using  Fourier 
techniques.  Thev  proposed  a  hierarchical  computational 
framework  for  early  processing  in  the  human  visual  system 
which  involves  the  use  of  spatiotemporal  linear  filters  tuned 
to  specific  frequencies  corresponding  to  specific  image 
velocities.  A  cascaded  configuration  of  orientation  specific 
filters  followed  by  speed  specific  filters  was  proposed. 
Recently,  Heeger  [9  I]  demonstrated  that  a  family  of  motion- 
sensitive  Gabor  filters  can  be  used  to  compute  optic  flow. 
He  used  3-D  (space-time)  Gabor  filters  tuned  to  different 
spatiotemporal-frequency  bands  and  described  a  method 
for  combining  the  outputs  of  the  filters  to  compute  local 
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be  written  as 


velocity  vectors.  He  has  further  suggested  a  parallel  imple¬ 
mentation  and  has  illustrated  the  performance  of  his 
approach  with  synthetic  as  well  as  real  data. 

The  determination  of  optical  flow  for  a  scene  consisting 
of  several  moving  objects  has  also  been  attempted. 
Research  has  un  used  on  segmenting  the  optic  flow  into 
regions  corresponding  to  distinct  objects  that  undergo  dif- 
terent  motion.  Murray  and  Buxton  [92]  use  a  Bayesian 
approach  to  formulate  the  segmentation  problem.  The  optic 
flow  field  is  modeled  as  spatial  and  temporal  Markov  ran¬ 
dom  fields.  The  search  tor  the  globally  optimal  segmen¬ 
tation  is  performed  using  simulated  annealing.  Thompson 
[93]  combines  optical  tlow  and  contrast  information  in  a 
region  growing  scheme  that  segments  images  into  regions 
corresponding  to  surfaces  moving  with  different  velocities. 
Thompson  et  a/.  [94]  detect  (low  boundaries  using  an  algo¬ 
rithm  patterned  alter  the  Marr-Hildreth  zero-crossing 
detector.  O'Rourke  proposed  a  method  to  group  rotating 
random  dot  patterns  [95].  Fennema  and  Thompson  extract 
moving  regions  In  collecting  similar  optical  flow  vectors 
[96].  Adiv  segmented  an  optic  flow  field  using  a  grouping 
method  based  on  a  Hough  voting  approach  [97],  Webb  and 
Aggarvva!  [48]  analv/ed  relative  motion  between  multi- 
|oi  nted  part  sot  objei  ts.  More  recently,  Tsu  kune  and  Aggar- 
wal  |98]  des;  ribe  a  method  tor  extracting  multiple  rotational 
tlow  uelds  in  the  Hough  space  for  orthographically  pro¬ 
tected  3-1)  velocity  vector  fields. 

B  Computing  Strut,  lure  and  J-0  Flaw 


(  \jl  yx\ 

u  =  (  x  —  — —  j  +■  (xyfi'  -  (1  -  x’)  liv  +  yfiz) 

(4.17a) 

/  V7  V*  \ 

v  =  (y  —  -  -J  •  c:  *  /)  n'  -  xynv  -  xttz). 

(4.17b) 

The  estimation  of  structure  and  motion  is  based  on  the 
key  assumptions  that  i)  the  optic  flow  varies  smoothly  and 
ii)  the  surface  of  the  object  is  smooth.  Assumption  i)  allows 
the  optic  flow  in  a  small  image  neighborhood  around  image 
location  (x,  y)  to  be  specified  by  a  Taylor  series  as: 

u(x,  y)  =  a,,  +  u,x  +  cy  y  +  u,,x‘ 

+  u„xy  +  uug  +  0((x,  y)  (4.18a) 

i'(x,  )')  =  v'„  +  vy  x  +  vy  v  +  v„x- 

-t-  vy,  xy  +  vy ,  y’  +  Ot(x,  y)  (4.18b) 

where  the  partial  derivatives  tan  be  computed  from  the 
optic  flow.  Assumption  ii)  allows  a  small  surface  patch  Z(X, 
V)  around  the  line  of  sight  to  be  described  as: 

Z  =  Z0  +  ZxX  +  Z,Y  +  \  ZxxXJ 

+  Z„XY  +  \  Zn  +  0,(X,  V)  (4.19) 


Having  computed  optical  tlow,  there  still  remains  the 
problem  of  c  nmputing  file  motion  and  the  structure  of  the 
nb|oct  in  three-dimensional  space.  A  mathematical  for¬ 
mulation  ot  the  basic  problem  is  tirst  presented.  The  for- 


Z(x,  v)  = 


for  Z0  >  0  is  the  distance  of  the  surface  patch  along  the  line 
of  sight.  Substituting  the  relation  (4.15)  for  Z  in  (4.19)  in  a 
recursive  manner  it  is  possible  to  further  approximate  the 
surface  in  terms  of  image  plane  coordinates  as: 

(4.20) 


(1  -  Zx  x  -  Zvy  -  l  Z„x2  -  Z„xy  -  {  Zyyy2  -  03(x,  y)) 


mutation  is  that  used  by  f’razdny  [99],  [100],  Longuet-Hig- 
gins  and  Pra/dnv  [101],  Waxman  et  al.  [102],  [103],  and 
Subbarao  [  104],  [105],  among  others. 

A  camera  centered  Cartesian  coordinate  system  (X,  V,  Z) 
is  used.  The  Z  axis  is  directed  along  the  viewing  direction. 
The  image  plane  is  normal  to  the  Z  axis  and  is  at  unit  dis¬ 
tance  from  the  origin.  The  image  coordinate  system  (x,  y) 
has  its  origin  at  (0,  0,  1).  The  x  and  y  axes  are  parallel  to  the 
X  and  Y  axes,  respectively.  In  the  perspective  projection 
geometry,  the  image  of  a  point  (X,  Y,  Z)  is  formed  by  drawing 
a  line  from  it  to  (0,  0,  0)  which  intersects  the  image  plane 
at  (x,  y).  Therefore 

x  =  X/Z  and  y  =  Y!Z.  (4.15) 

The  camera  is  assumed  to  be  in  motion,  with  V  —  (Vx,  VY, 
V /l  being  the  translational  velocity  an  A  -  (Px.  Pv,  Pz) 
being  the  rotational  velocity.  The  ins* or.  ->eous  velocity  of 
a  point  R  -  (X.  Y,  Z)  is  given  by  (X,  •  7 <  -(V  +  Q  x  R) 

as  follows: 

X  =  -Vx  -  PyZ  +  iTv 
v  =  -v’  -  nzx  +  'rz 
X  =  -  V1  -  YlxY  +  «VX.  (4.16) 

From  this  the  instantaneous  image  velocity  ( u ,  v)  =  (x,  y)  can 


where  Z>(  —  ZnZxx,  Zry  -  Z0Zyy,  Z(,  -  Z0Zxy. 

Further,  the  scaled  translational  velocities  are  denoted  as 
follows: 

Vv  Vz 

V'  =  — ,  Vy  =  — ,  V‘  =  — ,  for  Z0  >  0.  (4.21) 
Z-o  Z0  Z0 


From  (4.17),  (4.18),  (4.20),  and  (4.21)  it  is  possible  to  derive 
the  following  relations  [101]-[103],  [105]  assuming  rigid  uni¬ 
form  motion: 


Uo 

=  -vx  -  nY 

V<3  = 

-Vy  +  Q' 

U* 

=  -vz  +  VKZX 

Vy  = 

Vz  +  VyZy 

uy 

=  nz  +  vxzY 

V,  = 

-fiz  +  vyz 

X 

=  -2  VZZX  +  VXZ„  -  2Qy 

“.y  = 

-VZZy  +  V 

'Z.V 

+  nx 

Uyy 

=  vzyy 

= 

V'Z,, 

VXy 

=  -vzzx  +  vyzxy  -  nY 

vyy  = 

-2  V"Z»  + 

vyzry 

+  2  nx. 

(4.22) 

The  system  of  equations  (4.22)  relates  the  optic  flow  (u, 
v)  and  its  first-  and  second-order  spatial  derivatives  to  the 
3-D  structure  and  motion  parameters.  The  geometric  struc- 
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ture  for  the  smooth  surtace  is  specified  locally  by  the  sur¬ 
face  slopes  and  curvatures,  i.e.,  Zv  Zw  Z,,,  Z„,  and  Zw.  The 
three-dimensional  motion  parameters  are  the  components 
of  V  and  ft.  The  system  (4.22)  comprises  twelve  nonlinear 
equations  in  eleven  unknowns  and  is  thus  overdetermined. 
The  optic  lion  and  its  derivatives  are  available  using  any  of 
the  methods  outlined  in  the  previous  subsection.  The  over- 
deterrmned  system  (4 .22)  may.  hence,  be  solved  to  yield  the 
structure  and  motion  parameters. 

Many  interesting  observations  may  be  made  regarding 
the  above  equations.  Note  from  (4.21)  and  (4.22)  that  Z„  is 
not  recoverable  and  only  scaled  translational  velocity  and 
curvatures  mav  be  computed.  Everv  nonlinear  term  in  (4.22) 
is  a  produc  t  of  a  structural  parameter  and  a  translational 
velocity  component.  Every  curvature  parameter  in  (4.22)  is 
multiplied  bv  a  component  ot  translational  velocity  (V'  or 
V')  which  is  parallel  to  the  image  plane.  Hence,  it  there  is 
no  translation  parallel  to  the  image  plane,  surface  curva¬ 
tures  cannot  be  determined. 

The  nonlinear  overdetermined  svstem  (4.22)  may  or  may 
not  yield  a  unique  solution.  Many  situations  give  rise  to 
dependent  equations  in  (4.22)  engendering  multiple  solu¬ 
tions.  A  detailed  analysis  of  numerous  cases  has  been  pre- 
sonted  bv  subbarao  [104],  [  105)  and  Waxman  e/a/.  1 102],  [  10 i] 
who  have  derived  closed  form  solutions  for  these  cases. 
Subbarao  shows  that  in  general  the  solution  is  unique,  and 
at  most  four  solutions  are  possible  in  certain  situations. 
Negahdaripour  [106]  also  addressed  the  ambiguity  in  inter¬ 
preting  optic  flow  produced  by  curved  surfaces  in  motion. 
He  argues  that  the  ambiguity  is  at  most  three-fold  for  the 
case  of  certain  hyperboloids  of  one  sheet  viewed  by  an 
observer  moving  parallel  to  the  image.  The  ambiguities 
inherent  in  interpreting  noisy  flow  fields  arc  discussed  by 
Adiv  [107], 

An  overview  of  some  of  the  approaches  for  computing 
structure  and  motion  parameters  from  optic  flow  is  given 
below.  The  approaches  typically  involve  restricting  the 
nature  of  motion  to  be  purely  translatory  or  rotational  and/ 
or  restricting  the  imaged  surface  to  be  planar.  These 
assumptions  significantly  reduce  the  complexity  of  the  sys¬ 
tem  of  equations  (4.22). 

Williams  [108]  considered  the  computation  of  the  struc¬ 
ture  of  imaged  scene  components  for  the  situation  where 
the  sensor  was  involved  in  purely  translatory  motion.  The 
Focus  of  Expansion  (FOE)  of  image  flow  is  assumed  known 
and  the  scene  is  considered  to  consist  of  planar  surfaces. 
A  height  and  position  is  hypothesized  tor  each  segmented 
region.  An  image  is  generated  for  the  known  camera  motion 
and  compared  with  the  actual  image.  Error  in  the  hypoth¬ 
esized  structure  is  c  omputed  from  the  difference  between 
these  two  images  and  appropriate  corrections  are  made  to 
the  hypothesized  scene  structure.  This  procedure  is 
repeated  until  the  error  falls  below  a  threshold.  This 
approach  has  also  been  suggested  for  detecting  the  FOE. 

An  approach  for  determining  scene  structure  from  a 
sequence  of  images  acquired  by  a  translating  camera  is 
credited  to  Lawton  [109],  In  this  method,  features  are 
extracted  from  each  image.  Several  directions  of  camera 
motion  are  hypothesized.  Each  corresponds  to  a  unique 
FOE  or  FOC.  Image  feature  displacements  are  computed  for 
each  motion  and  compared  with  actual  displacements.  The 
motion  corresponding  to  minimum  error  in  feature  dis¬ 
placements  is  chosen  to  be  the  best  estimate.  Scene  struc¬ 


ture  is  computed  in  units  of  relative  depth,  i.e.,  ratio  of  depth 
to  change  in  depth.  The  technique  allows  for  the  segmen¬ 
tation  of  objects  at  different  depth. 

Rieger  and  Lawton  [110]  have  devised  a  method  for  deter¬ 
mining  the  instantaneous  axis  of  translation  for  a  camera 
undergoing  general  motion.  Their  method  is  based  on  the 
observation  of  Longuet-Higgins  and  Prazdny  [101]  that  two 
surface  points  whic  h  lie  on  the  same  ray  of  projection  but 
at  different  depths  will  have  image  velocities  that  differ  only 
by  the  difference  in  the  translational  components  of  their 
3-D  velocity.  Difference  vectors  are  computed  at  optic  flow 
discontinuities  and  the  intersection  of  these  difference  vec¬ 
tors  are  estimated  via  an  optimization  technique  similar  to 
that  used  in  [109],  The  translational  axis  is  specified  by  this 
procedure  and  the  computation  of  camera  rotation  and 
translation  is  simplified. 

Prazdny  [  111]  proposed  an  approach  in  which  the  velocity 
field  is  decomposed  into  rotational  and  translational  com¬ 
ponents.  The  rotational  motion  is  hypothesized  and  the  FOE 
is  identified  for  the  resultant  translational  field.  An  error 
function  of  three  parameters  is  used  to  evaluate  the  esti¬ 
mated  motion.  Minimization  of  the  error  yields  the  best 
estimate.  The  algorithm  has  been  tested  on  data  generated 
by  simulated  planar  surfaces  in  motion. 

Bruss  and  Horn  [112]  and  Horn  [71]  discuss  the  formu¬ 
lation  of  an  iterative  least-mean-squared  error  approach  to 
the  estimation  ot  j-D  motion  from  optical  flow.  They  make 
no  a  priori  assumptions  about  the  motion.  They  derive  a 
system  of  seven  equations,  three  of  which  are  linear  in  V\ 
V',  and  V' ,  and  four  which  are  solved  via  a  numerical 
method.  No  experimental  results,  however,  have  been 
shown.  Horn  and  Weldon  [113]  have  proposed  methods  for 
computing  purely  translational  or  purely  rotational  3-D 
motion  directly  from  brightness  gradients  without  com¬ 
puting  optical  flow.  They  employ  only  first  derivatives  of  the 
image  gray  levels,  and  analyticity  of  the  surface  is  not 
required.  Negahdaripour  and  Horn  [114]  discuss  the  recov¬ 
ery  of  motion  of  a  camera  relative  to  a  planar  surface.  They 
also  do  not  compute  optic  flow,  and  use  instead  the  spatial 
and  temporal  derivatives  of  brightness  values  directly.  They 
presen*  iterative  schemes  for  solving  nine  non-linear  equa¬ 
tions  based  on  a  least-squares  formulation,  and  also  pre¬ 
sent  a  closed  form  solution. 

Chou  and  Kanatani  [115]  use  a  scheme  in  which  object 
motion  is  initially  hypothesized  and  iteratively  refined.  They 
extract  features  from  the  images  obtained  before  and  after 
motion.  They  do  not  require  that  feature  correspondence 
be  established  a  prion.  They  transform  the  first  set  of  fea¬ 
tures  and  evaluate  the  discrepancy  between  the  estimated 
feature  positions  and  the  true  feature  positions  (in  the  sec¬ 
ond  image)  after  motion.  Assuming  infinitesimal  motion, 
they  relate  the  discrepancy  to  optic  flow  parameters.  They 
use  a  numerical  least-squares  technique  to  solve  the  linear 
constraints  for  a  better  estimate  of  the  motion.  This  process 
is  repeated  until  the  estimated  motion  produces  feature 
positions  that  are  sufficiently  close  to  the  true  ones  obtained 
in  the  second  image  after  motion. 

In  this  section  we  have  presented  the  optic  flow  approach 
for  the  estimation  of  motion  parameters  from  a  sequence 
of  monocular  images.  We  discussed  the  basic  formulation 
of  the  problem  and  outlined  some  of  the  recently  devel¬ 
oped  techniques  for  computing  the  optic  flow.  The  above 
discussion  included  the  problem  of  inferring  3-D  structure 
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and  motion  from  optic  flow  and  overviews  of  some  of  the 
solutions  to  this  problem. 

V.  C.omi'xrinc,  Oi’iu  Flow  and  Ft \u  ri -B \xi n  Miimms 

In  the  proceeding  sections  we  discussed  two  distinct 
approac  hes  for  the  estimation  of  motion  trom  monocular 
image  sequences,  i.e..  feature-based  analysis  and  optical 
flow  methods.  In  this  section  we  compare  the  two 
approaches  and  discuss  some  of  the  advantages  and  dis¬ 
advantages  associated  with  each  of  the  methods. 

Feature-based  approac  hes  require  that  correspondence 
be  established  between  a  sparse  set  of  features  extracted 
from  one  image  with  those  extracted  from  the  next  image 
in  the  sequence.  Although  several  methods  have  been  dis¬ 
cussed  tor  extracting  and  establishing  feature  correspon¬ 
dence,  the  task  is  difficult  and  only  partial  solutions  suitable 
for  simplistic  situations  have  been  developed.  In  general, 
the  process  is  complicated  by  occlusion  which  may  cause 
features  to  be  hidden,  false  features  to  be  generated  and 
hidden  features  to  reappear.  Much  more  work  needs  to  be 
done  in  this  area  before  the  advent  of  one  or  more  general 
techniques  that  can  be  reliably  applied  to  real  imagery.  In 
comparison,  the  optic  flow  approach,  in  general,  does  not 
require  any  teature  correspondence  to  be  established. 

The  computation  of  the  optical  flow  as  well  as  the  inter¬ 
pretation  of  motion  and  structure  from  optic  flow  requires 
the  evaluation  of  first  and  second  partial  derivatives  of 
image  brightness  values  and  also  of  the  optic  flow.  Real 
images  are,  in  general,  noisy.  The  evaluation  of  derivatives 
is  a  noise  enhancing  operation.  The  higher  the  order,  the 
more  noise  sensitive  is  the  derivative.  Hence,  even  in  cases 
where  closed  form  solutions  for  the  3-D  structure  and 
motion  exist,  the  optical  flow  techniques  do  not  produce 
usable  results  because  of  the  sensitivity  to  noise  [71],  Also, 
there  are  discontinuities  in  the  optical  flow  depending  upon 
occlusion,  and  these  regions  must  be  detected  reliably  oth¬ 
erwise  violations  of  the  continuity  assumption  will  have 
adverse  and  global  effects  on  the  estimate  of  optical  flow. 

In  contrast  to  the  method  of  global  minimization,  another 
approach  depends  upon  solving  a  set  of  constraints  in  a 
small  neighborhood.  However,  the  local  and  global  meth¬ 
ods  rely  on  similar  assumptions  of  smoothness  of  optical 
flow  field.  The  common  weakness  of  both  methods  is  the 
inaccurate  estimates  at  points  where  the  flow  changes 
sharply  or  is  discontinuous.  The  global  method  propagates 
the  errors  across  the  entiic  image,  while  the  neighborhood 
size  limits  the  propagation  in  local  methods.  Schunck  [70] 
and  Kearney  ef  a/.  [116],  [11 7]  address  these  difficulties  in 
detail.  Kearney  ef  al.  present  a  detailed  analysis  of  the 
sources  of  errors  in  local  optimization  techniques  for  com¬ 
puting  optical  flow  [116].  They  identify  three  main  sources 
of  error: 

1)  Poor  estimation  of  brightness  gradients  in  highly  tex¬ 
tured  image  regions.  The  problem  is  especially  severe 
for  temporal  gradients  in  moving  regions. 

2)  Variations  in  optic  flow  across  the  image  violate  the 
assumption  of  locally  constant  flow.  Significant  error 
arises  at  discontinuities  in  the  flow  field. 

3)  Insufficient  local  variation  in  the  orientation  of  the 
brightness  gradient  which  causes  error  propagation 
in  the  ill-conditioned  system. 


Sensitivity  to  noise  is  also  a  problem  with  the  feature 
based  techniques  though  to  a  lesser  degree.  The  tech¬ 
niques  reported  in  the  literature  have  all  been  only  mar¬ 
ginally  tolerant  to  noise.  One  method  of  decreasing  the  sen¬ 
sitivity  to  noise  has  been  to  use  more  than  the  required 
minimum  number  of  features  in  an  iterative  least-squares 
technique.  Although  this  usually  has  a  smoothing  effect,  it 
can  cause  additional  complications.  For  example,  if  all  the 
additional  points  chosen  arecoplanar,  then  all  that  has  been 
achieved  is  a  significant  increase  in  the  computation  time 
and  probable  instability  of  the  solution.  The  establishment 
of  correspondence  also  becomes  computationally  expen¬ 
sive. 

Recently,  Verri  and  Poggiof  118]  argued  that  the  optic  flow 
does  not  correspond  to  the  2-D  velocity  field  unless  very 
special  conditions  are  satisfied.  They  argue  against  the  use 
of  optic  flow  for  quantitative  estimates  of  3-D  motion.  They 
applv  the  theory  of  stability  of  dynamical  systems  to  the 
optic  f*ow  formulation  and  c  onclude  that  the  optic  flow  may 
provide  stable  qualitative  information  such  as  the  Focus  of 
Expansion  and  motion  discontinuities. 

When  numerical  techniques  are  used  tor  the  solution  of 
structure  and  motion  using  either  approach  one  must  con¬ 
sider  the  many  caveats  involved  in  such  a  solution.  A  dis¬ 
cussion  of  these  caveats  would  be  inappropriate  in  this 
paper  and  the  reader  is  directed  to  the  literature  in  numer¬ 
ical  analysis  for  possible  pitfalls  and  remedies. 

Much  attention  has  been  devoted  recently  by  the  com¬ 
puter  vision  community  to  the  use  of  regularization  tech¬ 
niques  in  many  vision  tasks  including  both  feature-based 
formulations  and  the  optic  flow  approach  for  motion  and 
structure  estimation  [119]-[123],  This  technique  is  used  to 
reformulate  certain  ill-posed  problems  into  well-posed 
problems.  The  ill-posed  problems  are  those  for  which  either 
1)  the  solution  exists  but  is  not  unique,  or  2)  the  solution 
does  not  depend  continuously  on  the  input  data.  Regular¬ 
ization  is  typically  formulated  as  an  error  minimization  and 
involves  a  stabilizing  functional  that  is  applied  to  the  input 
data  and  perhaps  an  additional  smoothing  parameter.  Due 
to  the  seemingly  infinite  choice  of  possible  stabilizing  func¬ 
tions  and  smoothness  parameters  it  is  difficult  to  specify  a 
best  regularizing  algorithm  for  an  application. 

VI.  Computing  Motion  from  a  Sequenci  of  Stereo 
Images 

The  technique  described  in  the  previous  sections  deter¬ 
mine  the  motion  and  structure  of  an  object  given  a  sequence 
of  monocular  images  of  the  scene.  It  was  seen  that  in  both 
the  feature-based  methods  as  well  as  in  the  optic  flow  tech¬ 
niques,  the  solutions  for  structure  and  motion  remain 
ambiguous  with  respect  to  absolute  value  of  distance 
between  the  camera  and  the  scene.  In  other  words,  struc¬ 
ture  and  motion  parameters  are  unique  only  up  to  a  scaling 
factor.  The  use  of  stereoscopy  can  provide  this  additional 
parameter  to  uniquely  determine  depth  and  hence  abso¬ 
lute  values  for  the  structure  and  motion  parameters. 

The  fusion  of  stereo  and  motion  may  be  effected  with 
different  objectives  in  mind.  Stereoscopic  processing  may 
be  used  to  aid  motion  recovery,  or  con  ersely,  motion  anal¬ 
ysis  may  be  used  to  help  establish  feature  correspondence 
in  stereo  image  pairs.  The  fusion  of  these  two  processing 
modules  in  human  and  other  biological  visual  systems  has 
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been  detected  via  neurobiological  and  psychophysiologi- 
cal  investigation  [124],  [125].  Recent  research  in  both  the 
feature-based  and  optic  flow  based  approaches  has 
addressee  the  fusion  of  stereoscopic  analysis  and  motion 
estimation.  We  outline  the  salient  features  of  such  effort. 

A  Feature  Baaed  Analysis 

The  overall  analysis  consists  of  the  following  steps:  i)  From 
the  sequences  of  stereo  images,  the  depth  map  tor  each 
stereo  pair  is  determined,  ii)  the  correspondence  between 
three-dimensional  features  in  successive  depth  maps  is 
established,  and  iii)  the  motion  of  the  objects  is  computed 
based  upon  the  matched  features.  This  formulation  of 
motion  analysis  based  on  sequences  of  stereo  images  has 
several  advantages  and  disadvantages  which  are  briefly  dis¬ 
cussed  below. 

Kim  and  Aggarwal  discuss  the  estimation  of  motion 
parameters  trom  a  sequence  of  depth  maps  extracted  from 
stereo  images  [63].  The  depth  map  for  each  stereo  pair  is 
computed  using  an  edge-based  stereo  algorithm.  3-D  fea¬ 
tures  (consisting  ot  lines  and  points)  are  extracted  from  each 
depth  map.  These  features  are  matched  between  succes¬ 
sive  depth  maps  using  a  two  pass  relaxation  process  [61], 
[62].  In  the  process  of  extraction,  search  and  matching,  the 
search  space  is  limited  to  the  area  of  the  motion  in  the  image 
by  an  image  differencing  technique. 

In  general,  correspondences  between  two  3-D  lines 
extracted  from  one  depth  map  and  those  from  another  may 
be  used  to  determine  the  motion  of  a  rigid  object,  assuming 
that  the  motion  is  small.  Flere,  a  three-dimensional  line  is 
specified  by  a  three-dimensional  direction  and  a  point  on 
the  line.  The  same  method  can  be  used  for  three-dimen¬ 
sional  point  correspondences  since  two  points  determine 
a  line.  In  general,  three  point  correspondences,  or  one  line 
correspondence  and  one  point  correspondence  are  suffi¬ 
cient  to  determine  the  three-dimensional  motion  param¬ 
eters  of  a  moving  object.  In  the  former  case,  the  three  points 
should  not  be  collinear,  and  in  the  latter  case,  the  point 
should  not  lie  on  the  same  line.  A  system  of  linear  equations 
is  derived  and  the  solution  is  straightforward.  A  system 
based  upon  these  observations  has  been  implemented  to 
derive  the  structure  and  the  displacement  of  the  objects 
between  the  views.  In  this  study  the  motion  of  simple  toy 
objects  was  estimated  with  excellent  results  [63]. 

Although  it  is  theoretically  quite  easy  to  estimate  the 
motion  parameters  given  the  correspondence  between  two 
sets  of  3-D  points,  practical  considerations  complicate  the 
implementation  of  the  system.  In  stereo  imagery,  the  range 
values  estimated  are  subject  to  a  great  deal  of  uncertainty 
due  primarily  to  quantization  of  disparity.  More  robust  for¬ 
mulations  of  the  problem  of  motion  estimation  using 
sequences  of  stereo  images  have  been  proposed  [126]-[128]. 
One  approach  has  been  to  estimate  motion  parameters  via 
a  system  of  linear  equations  using  3  points  in  each  depth 
map  [126].  Several  sets  of  3  points  are  chosen  from  the  large 
number  of  available  points  and  the  motion  parameters  are 
computed  for  each  set.  For  each  set  of  computed  motion 
parameters,  all  available  points  in  the  first  depth  map  are 
subjected  to  the  estimated  motion.  The  discrepancy 
between  the  points  in  the  second  depth  map  and  trans¬ 
formed  points  from  the  first  depth  map  is  computed  via  a 
simple  distance  measure.  The  set  of  estimated  motion 


parameters  that  yields  the  lowest  error  is  chosen.  Although 
the  solution  of  the  system  of  linear  equations  is  easy,  the 
estimation  of  large  sets  of  motion  parameters  and  espe¬ 
cially  the  search  for  the  best  set  of  motion  parameters  is 
computationally  intensive. 

An  alternative  approach  has  been  to  use  a  least-mean- 
squared  error  analysis  [127],  [129].  The  underlying  principle 
here  is  again  the  invariance  of  distance  between  points  on 
an  object  subjected  to  rigid  motion.  The  formulation  is  anal¬ 
ogous  to  the  approach  followed  by  Magee  and  Aggarwal 
[130],  [131]  for  determining  motion  parameters  from 
sequences  of  range  images.  While  the  direct  method  of 
solution  is  adopted  in  [130],  [131],  a  two-part  iterative 
approach  is  adopted  in  [127],  The  displacement  between 
the  centroids  of  two  sets  of  registered  3-D  points  is  used 
to  determine  the  translation  vector.  The  rotation  matrix  is 
decomposed  into  three  factors  corresponding  to  rotations 
about  the  z,  x,  and  y  axes.  Each  of  these  is  individually  solved 
for  while  the  other  two  are  fixed.  This  is  repeated  in  a  cyclic 
manner  until  a  least  mean  squared  error  criterion  is  sat¬ 
isfied.  The  advantage  ot  the  above  decomposition  is  that 
the  3-D  estimation  problem  reduces  to  a  set  of  2-D  problems 
which  are  more  tractable. 

The  above  approaches  consider  the  determination  of 
structureand  motion  as  separate  issues.  Hence,  if  structure 
is  first  computed  (as  is  usually  the  case  for  stereo  imagery) 
then  errors  accrued  due  to  quantization  ot  disparity  will 
continue  to  plague  the  estimation  of  motion.  To  alleviate 
this  problem  a  new  approach  has  been  developed  by  Kiang, 
Chou  and  Aggarwal  [132]  based  on  iterative  refinement  of 
both  structure  and  motion  estimates.  The  approach  is  based 
on  a  1-D  model  for  triangulation  error  in  stereoscopy.  The 
strategy  for  modifying  structure  and  motion  estimates  is 
based  on  the  structural  relationship  between  the  corre¬ 
sponding  uncertainty  polyhedra  in  successive  depth  maps. 
Experimental  results  using  synthetic  as  well  as  real  data 
demonstrate  significant  improvement  in  the  estimation  of 
both  structure  and  motion  when  compared  to  the  conven¬ 
tional  techniques  based  on  reducing  least-mean-squared 
error  in  motion  alone. 

Aloimonos  and  Rigoutsos  [133]  have  developed  a  scheme 
for  computing  3-D  motion  parameters  from  a  sequence  of 
stereo  imagery  which  does  not  require  a  priori  establish¬ 
ment  of  correspondences.  The  features  extracted  from  the 
left  and  right  images  are  assumed  to  lie  on  a  planar  surface 
Z  =  pX  +  qY  +  c.  Perspective  imaging  geometry  is 
assumed.  The  image  planes  are  parallel  to  the  X  -  V  plane. 
The  parameters  p,  q,  and  c  are  acquired  by  solving  a  set  of 
linear  equations  in  which  the  coefficient  of  each  of  the 
unknowns  consists  of  a  function  of  a  sum  of  the  image  coor¬ 
dinates.  The  solution  of  the  linear  equation  provides  the 
structure  of  the  scene.  Applying  this  process  before  and 
after  the  planar  surface  undergoes  motion  allows  for  the 
estimation  of  the  motion  parameters.  The  method  devel¬ 
oped  was  not  as  robust  as  was  expected  and  was  modified 
by  including  a  third  camera.  The  performance  of  the  algo¬ 
rithm  in  presence  of  noise  is  described  in  [133], 

Another  technique  for  estimating  3-D  motion  parameters 
from  two  3-D  point  sets  without  establishing  correspon¬ 
dence  has  been  presented  by  Lin  ef  a/.  [134],  The  algorithm 
is  based  on  the  property  that  a  function  and  its  Fourier 
transform  must  experience  the  same  rotation.  The  trans¬ 
lation  is  first  determined  from  the  displacement  of  the  cen- 
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troid.  Two  functions  are  defined  on  (he  feature  set.  A  cor¬ 
relation  between  the  Fouuer  transforms  of  these  functions 
is  determined.  The  rotation  axis  and  angle  are  computed 
based  on  this  procedure.  Some  simulation  results  have  been 
presented  [  134). 

The  above  techniques  are  representative  of  the 
approaches  wherein  stereopsis  aids  the  recovery  of  motion. 
There  exist  many  reports  in  recent  literature  discussing  the 
use  of  motion  in  recovering  structure,  e.g.,  jenkin  [135)  used 
instantaneous  velocities  at  feature  points  to  aid  the  estab¬ 
lishment  of  stereo  correspondence,  Nevatia  [136],  Mutch 
[137],  Xu  ef  al.  [138],  and  lain  [139],  among  others,  used 
known  motion  parameters  to  simulate  stereo.  We  feel  that 
although  this  approach  is  related  to  the  estimation  of 
motion,  it  is  a  separate  field  in  itself.  Hence,  we  do  not  pur¬ 
sue  any  further  the  discussion  of  the  use  of  known  motion 
to  aid  stereopsis,  and  we  limit  our  discussion  to  the  use  of 
stereoscopy  for  estimation  of  motion. 

B.  Multiple  Optic  Flow  Fields 

In  Section  IV  we  discussed  the  interpretation  of  optic  flow 
fields  obtained  from  a  sequence  of  monocular  images. 
Another  approach  has  been  to  compute  multiple  optic  flow 
fields  from  different  views,  to  establish  correspondence 
between  them  and  reconstruct  3-D  velocity  vector  fields. 

Mitiche  [140]  assumes  that  optic  flow  is  computed  for  each 
view  in  a  stereoscopic  imaging  system  for  which  the  ste¬ 
reoscopy  parameters  are  known.  He  further  assumes  that 
correspondence  between  points  in  the  two  images  are 
available  which  allows  for  the  estimation  of  depth.  Mitiche 
shows  that  given  this  information  it  is  posssible  to  compute 
the  3-D  motion  parameters  in  a  straightforward  manner. 
Waxman  and  Sinha  [141]  have  used  a  similar  approach.  In 
addition,  they  have  filtered  the  optic  flow  field  to  minimize 
the  effects  of  noise.  Nagel  [142]  has  also  attempted  such 
stereo-motion  fusion  techniques  and  has  devised  an 
approach  based  on  the  minimization  of  an  error  function. 
Tsukune  and  Aggarwal  [98]  have  used  this  approach  for 
reconstructing  3-D  velocity  fields  for  a  scene  containing 
multiple  objects  in  motion. 

Richards  [143]  demonstrated  that  the  relative  rate  of 
change  of  disparity  (ratio  between  temporal  rate  of  change 
of  disparity  and  disparity)  due  to  object/camera  motion  is 
a  useful  aid  in  establishing  feature  correspondence  within 
a  pair  of  stero  images.  Waxman  and  Duncan  [144]  used  the 
ratio  between  relative  flow  and  disparity  to  aid  the  estab¬ 
lishment  of  stereo  correspondence.  The  relative  flow  is 
defined  to  be  the  differ  nee  between  the  optic  flow  at  a 
point  in  the  left  image  and  that  at  the  corresponding  point 
in  the  right  image.  Waxman  and  Duncan  show  that  their 
ratio  is  identical  to  the  one  devised  by  Richards  [143]. 

VII.  Conclusion 

In  this  paper  we  have  reviewed  recently  developed  tech¬ 
niques  for  estimating  structure  and  motion  from  sequences 
of  monocular  and  stereoscopic  images.  We  discussed  two 
distinct  approaches:  feature-based  analysis  and  optic  flow 
techniques.  We  described  some  of  the  different  mathe¬ 
matical  formulations  that  have  been  developed  for  each  of 
these  tasks.  A  comparison  of  the  feature-based  and  optic 
flow  methods  was  then  presented  in  which  the  relative  mer¬ 


its  anu  demerits  of  both  approaches  were  discussed.  An 
overview  of  the  fusion  of  stereoscopy  and  motion  analysis,  - 
especially  for  aiding  the  estimation  of  motion,  was  pre¬ 
sented. 

The  optic  flow  approach  consists  of  computing  the  two- 
dimensional  field  of  instantaneous  velocities  of  brightness 
values  (gray  levels)  in  the  image  plane.  Instead  of  consid¬ 
ering  temporal  rhangpc  in  image  brightness  values  in  com¬ 
puting  the  optic  flow  field,  it  is  possible  to  also  consider 
temporal  changes  in  values  that  are  the  result  of  applying 
various  local  operators  such  as  contrast,  entropy,  and  spa¬ 
tial  derivatives  to  the  image  brightness  values.  I  neither  case, 
a  relatively  dense  flow  field  is  estimated,  usually  at  every 
pixel  in  the  image.  The  optic  flow  is  then  used  in  conjunc¬ 
tion  with  added  constraints  or  information  regarding  the 
scene  to  compute  the  actual  three-dimensional  relative 
velocities  between  scene  objects  and  camera. 

The  feature-based  approach  is  based  on  extracting  a  set 
of  relatively  sparse  but  highly  discriminatory  set  of  two- 
dimensional  features  in  the  images  corresponding  to  three- 
dimensional  object  features  in  the  scene  such  as  corners, 
occluding  boundaries  of  surfaces,  and  boundaries  demar¬ 
cating  changes  in  surface  reflectivity.  Such  points,  lines  and/ 
or  curves  are  extracted  from  each  image.  Inter-frame  cor¬ 
respondence  is  then  established  between  these  features. 
Constraints  are  formulated  based  on  assumptions  such  as 
rigid  body  motion,  e.g.,  the  3-D  distance  between  two  fea¬ 
tures  on  a  rigid  body  remains  the  same  after  object/camera 
motion.  Such  constraints  usually  result  in  a  system  of  non¬ 
linear  equations.  The  observed  displacement  of  the  2-D 
image  features  are  used  to  solve  these  equations  leading 
ultimately  to  the  computation  of  motion  parameters  of 
objects  in  the  scene. 

In  the  feature-based  approach,  the  main  problems 
encountered  are  seen  to  be:  1)  establishing  and  maintaining 
correspondence  between  the  image  plane  features,  2) 
robust  formulation  of  the  problem  which  is  usually  based 
on  the  assumption  that  the  viewed  object  undergoes  rigid 
motion,  and  3)  developing  appropriate  iterative  algorithms 
which  are  stable  and  accurate.  The  optic  flow  based 
approach  suffers  from  a  different  set  of  drawbacks,  i.e.,  1) 
it  is  highly  noise  sensitive  due  to  its  dependence  on  spatio- 
temporal  gradients,  2)  it  requires  that  motion  be  smooth 
and  small  thus  requiring  a  high  rate  of  image  acquisition, 
and  3)  it  requires  that  motion  vary  continuously  over  the 
image.  Both  approaches  also  are  affected  by  object  occlu¬ 
sion  and  choice  of  initial/boundary  conditions.  The  use  of 
sequences  of  stereoscopic  images  provides  three-dimen¬ 
sional  points  and  lines  which  somewhat  simplify  the  prob¬ 
lem  of  estimating  motion. 

A  great  deal  of  future  research  effort  is  warranted  to  over¬ 
come  the  obstacles  mentioned  above.  The  significant  con¬ 
tributions  made  by  various  researchers  in  this  area  during 
the  recent  past  is  to  be  noted  and  this  trend  may  be  expected 
to  continue  in  the  future.  Two  workshops,  one  in  Europe 
[145]  and  one  in  the  USA  [146]  are  planned  in  the  near  future 
to  engender  progress  in  this  challenging  area. 
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